Case-Based Reasoning for Reducing Software Development Effort

doi:10.4236/jsea.2010.311118

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2010, 3, 1005-1014

doi:10.4236/jsea.2010.311118 Published Online November 2010 (http://www.SciRP.org/journal/jsea)

Case-Based Reasoning for Reducing Software

Development Effort

Adam Brady1, Tim Menzies1, Oussama El-Rawas1, Ekrem Kocaguneli1, Jacky W. Keung2

1Lane Department of CS&EE, West Virginia University, Morgantown, USA; 2Department of Computing, The Hong Kong Polytech-

nic University, Hong Kong, China.

Email: adam.m.brady@gmail.com, tim@menzies.us, orawas@gmail.com, kocaguneli@gmail.com, jacky.keung@comp.polyu.edu.hk

Received June 24th, 2010; revised July 15th, 2010; accepted July 19th, 2010.

ABSTRACT

How can we best find project changes that most improve project estimates? Prior solutions to this problem required the

use of standard software process models that may not be relevant to some new project. Also, those prior solutions suf-

fered from limited verification (the only way to assess the results of those studies was to run the recommendations back

through the standard process models). Combining case-based reasoning and contrast set learning, the W system re-

quires no underlying model. Hence, it is widely applicable (since there is no need for data to conform to some so ftware

process models). Also, W’s results can be verified (using holdout sets). For example, in the experiments reported here,

W found changes to projects that greatly reduced estimate median and variance by up to 95% and 83% (respectively).

Keywords: Sof tw are Eff ort Est i mat i on, C ase Based Reasoning , Effort Estimation

1. Introduction

Existing research in effort estimations focuses mostly

on deriving estimates from past project data using (e.g.)

parametric models [1] or case-based reasoning (CBR)

[2] or genetic algorithms [3]. That research is curiously

silent on how to change a project in order to, say, re-

duce development effort. That is, that research reports

what is and not what should be changed.

Previously [4-7], we have tackled this problem using

STAR/NOVA, a suite of AI search algorithms that ex-

plored the input space of standard software process

models to find project options that most reduced the

effort estimates. That approach had some drawbacks

including 1) the dependency of the data to be in the

format of the standard process models, 2) the imple-

mentation complexity of the Monte Carlo simulator

and the AI search engines, and 3) the lack of an inde-

pendent verification module (the only way to assess the

results of those studies was to run the recommenda-

tions back through the standard process models).

This paper describes “W”, a simpler, yet more general

solution to the problem of finding project changes that

most improves project estimates. Combining case-based

reasoning and contrast set learning, W requires no un-

derlying parametric model. Hence, W can be applied to

more data sets since there is no requirement for the

data sets to be in the format required by the standard

software process models. For example, this paper ex-

plores five data sets with W, three of which cannot be

processed using STAR/NOVA.

W is simpler to implement and easier to use than

STAR/NOVA, requiring hundreds of lines of the

scripting language AWK rather than thousands of lines

of C++/LISP. For example, we describe below a case

study that finds a loophole in Brooks’ Law (“adding

manpower [sic] to a late project makes it later”). Using

W, that study took three days which included the time

to build W, from scratch. An analogous study, based on

state-of-the-art model-based AI techniques, took two

years.

Also, the accuracy of STAR/NOVA is only as good

as the underlying model (the USC COCOMO suite [1]).

W’s results, on the other hand, are verified using

hold-out sets on real-world data. In the experiments

reported below, we show that W can find changes to

projects that greatly reduce estimate median and vari-

ance by up to 95% and 83%, respectively.

Finally, the difference between W, which finds what

to change in order to improve an estimate, and a stan-

dard case-based effort estimator, which only generates

estimates, is very small. Based on this experiment, we

Case-Based Reasoning for Reducing Software Development Effort

1006

advise augmenting standard CBR tools with modules

like the planning sub-systems in W.

The rest of this paper is structured as follows. After

some background notes on effort estimation and

STAR/NOVA, we describe the general framework for

case-based reasoning. The W extension to CBR is then

described (contrast set learning over the local neigh-

borhood), using a small example. This is followed by

fourteen case studies, one with Brooks’ Law, and thir-

teen others. Our conclusion will discuss when W is

preferred over STAR/NOVA.

2. Background

2.1. Why Study Effort Estimation?

Generating (and regenerating) project effort estimates is

an important and continuous process for project manag-

ers. Not only do good estimates allow the better use of

resources, but by reviewing and improving their estima-

tion process, a software company can learn and improve

from their past experience.

Sadly, we often get estimates wrong. Consider the

NASA’s Checkout Launch Control System, which was

canceled when the initial estimate of $200 million was

overrun by an additional $200 M [8]. This case is not

unique, despite the significant effort put into designing

more accurate estimation models. It has been reported

that many predictions are wrong by a factor of four or

more [9,10]. In order to conduct software effort estima-

tion, it is standard practice to use models to estimate ef-

fort. Many software process models have emerged aim-

ing to achieve that task, and there has not emerged a sin-

gle standardized model that is widely used by the soft-

ware engineering industry. There several reasons for this

including generality, data islands and instability. Soft-

ware models may not be general so it can be inappropri-

ate to apply a software model learned in one environment

one to another. Also, many companies prefer to keep cost

related data confidential. This data island effect has also

contributed to the fragmentation of the field by compa-

nies preferring to build private models rather than using

publicly available models. This multiplicity of software

effort models has lead to scarcity of publicly available,

local data needed for model based effort estimation.

Without sufficient data to build, audit, and tune models,

the predictions generated by these models may be highly

unstable. Baker [11] reports a study that learned values

for the (a, b) (linear, exponential) constants in Boehm’

COCOMO software process model [9]. The study was

repeated 100 times, each time selecting from a random

sample of 90% of the project data. The learned values for

(a, b) exhibited an alarming variance:

(2.2 ≤ a ≤ 9.18) ^ (0.88 ≤ b ≤ 1.09) (1)

Such large variations make it hard to understand the ef-

fects of changing project options. Suppose some pro-

posed change doubles productivity, but a moves from 9

to 4.5. The improvement resulting from that change

would be obscured by the tuning variance.

2.2. STAR and NOVA

The SBSE literature inspired us to try simulated anneal-

ing to search the what-ifs associated with Equation 1.

This lead to the STAR system [4,6]. NOVA was a gen-

eralization of STAR that included simulated annealing

and other search engines [5,7].

STAR/NOVA handled model variance by finding

conclusions that were stable across the space of possible

tunings. This analysis assumed that, for a mature effort

estimation model, the range of possible tunings was

known (this is the case for models like COCOMO). For

such models, it is possible for the AI search engines to

find conclusions that hold across the space of all tunings.

STAR and NOVA constrain project options P but not

the tuning options T. Hence, their recommendations con-

tain subsets of the project options P that most improve the

score, despite variations in the tunings T. This approach

meant we could reuse COCOMO models requiring using

local tuning data. The following is a description that

briefly presents the operation of STAR and NOVA:

1) SAMPLE: We sample across the ranges of all the

attributes in the model n1 times. Often we sample ran-

domly across the range. Some heuristics allow us to

concentrate more on the extremes of the range.

2) DIS CRETIZE: The data seen in the n1 samples is

then discretized into D = 10 bins. Equal frequency bins

were used.

3) CLASSIFY: The top n% projects are classified as

best or rest.

4) RANK: The ranges are then ranked in increasing

order using Support-Based Bayesian Ranking.

5) PRUNE: STAR runs n2 experiments with the mod-

els where the top ranked ranges 1…X ranges are pre-set

and the remaining ranges can be selected at random.

6) REPORT: STAR returns the 1…X settings that op-

timize the best for the fitness function. These settings are

determined by iterating back from the minimum point

achieved towards the first point that is statistically simi-

lar to the minimum point.

In practice, STAR/NOVA approach was very effective.

Figure 1 shows the large effort reductions found by

STAR in three out of four cases presented at ASE’09. It

is insightful to reflect about STAR/NOVA’s failure to

find large reductions in the fourth case study (nasa93

osp2). In this project, management had already fixed

most of the project options. STAR/NOVA failed, in this

case, since there was very little left to try and change.

Case-Based Reasoning for Reducing Software Development Effort 1007

StudyNOVA

nasa93 flight72%

nasa93 ground73%

nasa93 osp42%

nasa93 osp25%

Figure 1. Effort estimate improvements found by NOVA.

from [12].

This fourth case study lead to one of the lessons learned

of STAR/NOVA: apply project option exploration tools

as early as possible in the lifecycle of a project. Or, to

say that more succinctly: if you fix everything, there is

nothing left to fix [13].

While a successful prototype, STAR/NOVA has cer-

tain drawbacks:

 Model dependency: STAR/NOVA requires a model

to calculate (e.g.) estimated effort. In order to do so,

we had to use some software process models to

generate the estimates. Hence, the conclusions

reached by STAR/NOVA are only as good as this

model. That is, if a client doubts the relevance of

those models, then the conclusions will also be

doubted.

 Data Dependency: STAR/NOVA’s AI algorithms

explored an underlying software process model.

Hence, it could only process project data in a format

compatible with the underlying model. In practice,

this limits the scope of the tool.

 Inflexibility: It proved to be trickier than we

thought to code up the process models, in a manner

suitable for Monte Carlo simulation. By our count,

STAR/NOVA’s models required 22 design deci-

sions to handle certain special cases. Lacking guid-

ance from the literature, we just had to apply “en-

gineering judgment” to make those decisions. While

we think we made the right decisions, we cannot

rigorously justify them.

 Performance: Our stochastic approach conducted

several tens of thousands of iterations to explore the

search space, with several effort estimates needing

calculation with each iteration. This resulted in a

performance disadvantage.

 Size and Maintainability: Due to all the above fac-

tors, our code base proved difficult to maintain.

While there was nothing in principle against apply-

ing our techniques to other software effort models,

we believe that the limiting factor on disseminating

our technique is the complexity of our implementa-

tion. As partial evidence for this, we note that in the

three years since we first reported our technique [6]:

 We have only coded one set of software process

models (COCOMO), which inherently limited the

scope of our study.

 No other research group has applied these tech-

niques. Therefore, rather than elaborate a complex

code base, we now explore a different option, based

on Case Based Reasoning (CBR). This new ap-

proach had no model restrictions (since it is does

not use a model) and can accommodate a wide

range of data sets (since there are no restrictions of

the variables that can be processed).

While there was nothing in principle against applying

our techniques to other software effort models, we be-

lieve that the limiting factor on disseminating our tech-

nique is the complexity of our implementation. As partial

evidence for this, we note that in the three years since we

first reported our technique [6]:

 We have only coded one set of software process

models (COCOMO), which inherently limited the

scope of our study.

 No other research group has applied these tech-

niques.

Therefore, rather than elaborate a complex code base,

we now explore a different option, based on Case Based

Reasoning (CBR). This new approach had no model re-

strictions (since it does not use a model) and can ac-

commodate a wide range of data sets (since there are no

restrictions of the variables that can be processed).

3. Case-Based Reasoning (CBR)

Case based reasoning is a method of machine learning

that seeks to emulate human recollection and adaptation

of past experiences in order to find solutions to current

problems. That is, as humans we tend to base our deci-

sions not on complex reductive analysis, but on an in-

stantaneous survey of past experiences [14]; i.e., we

don’t think, we remember. CBR is purely based on this

direct adaptation of previous cases based on the similar-

ity of those cases with the current situation. Having said

that, a CBR based system has no dedicated world model

logic, rather that model is expressed through the avail-

able past cases in the case cache. This cache is continu-

ously updated and appended with additional cases.

Aamodt & Plaza [15] describe a 4-step general CBR

cycle, which consists of:

1) Retrieve: Find the most similar cases to the target

problem.

2) Reuse: Adapt our actions conducted for the past

cases to solve the new problem.

3) Revise: Revise the proposed solution for the new

problem and verify it against the case base.

4) Retain: Retain the parts of current experience in the

case base for future problem solving.

Having verified the results from our chosen adapted ac-

tion on the new case, the new case is added to the avail-

able case base. The last step allows CBR to effectively

Case-Based Reasoning for Reducing Software Development Effort

1008

learn from new experiences. In this manner, a CBR sys-

tem is able to automatically maintain itself. As discussed

below, W supports retrieve, reuse, and revise (as well as

retain if the user collecting data so decides).

This 4-stage cyclical CBR process is sometimes re-

ferred to as the R4 model [16]. Shepperd [16] considered

the new problem as a case that comprises two parts.

There is a description part and a solution part forming

the basic data structure of the system. The description

part is normally a vector of features that describe the

case state at the point at which the problem is posed. The

solution part describes the solution for the specific prob-

lem (the problem description part).

The similarity between the target case and each case in

the case base is determined by a similarity measure. Dif-

ferent methods of measuring similarity have been pro-

posed for different measurement contexts. A similarity

measure is measuring the closeness or the distance be-

tween two objects in an n-dimensional Euclidean space,

the result is usually presented in a distance matrix (simi-

larity matrix) identifying the similarity among all cases

in the dataset. Although there are other different distance

metrics available for different purposes, the Euclidean

distance metric is probably the most commonly used in

CBR for its distance measures.

Irrespective of the similarity measure used, the objec-

tive is to rank similar cases from case-base to the target

case and utilize the known solution of the nearest k-cases.

The value of k in this case has been the subject of debate

[17,2]: Shepperd [2], Mendes [18] argue for k = 3 while

Li [3] propose k = 5.

Once the actual value of the target case is available it

can be reviewed and retained in the case-base for future

reference. Stored cases must be maintained over time to

prevent information irrelevancy and inconsistency. This

is a typical case of incremental learning in an organiza-

tion utilizing the techniques of CBR.

Observe that these 4 general CBR application steps

(retrieve, reuse, revise, retain) do not include any explicit

model based calculations; rather we are relying on our

past experience, expressed through the case base, to es-

timate any model calculations based on the similarity to

the cases being used. This has two advantages:

1) It allows us to operate independently of the models

being used. For example, our prior report to this confer-

ence [13] ran over two data sets. This study, based on

CBR, uses twice as many data sets.

2) This improves our performance, since data retrieval

can be more efficient than calculation, especially given

that many thousands of iterations of calculation were

needed with our traditional modeling based tool. As evi-

dence of this, despite the use of a slower language, W’s

AWK code runs faster than the C++/LISP used in

STAR/NOVA. It takes just minutes to conduct 20 trials

over 13 data sets with W. A similar trial, conducted with

NOVA or STAR, can take hours to run.

4. From CBR to W

A standard CBR algorithm reports the median class val-

ue of some local neighborhood. The W algorithm treats

the local neighborhood in a slightly different manner:

 The local neighborhood is divided into best and

rest.

 A contrast set is learned that most separates the re-

gions (contrast sets contain attribute ranges that are

common in one region, but rare in the other).

 W then searches for a subset of the contrast set that

best selects for (e.g.) the region with lower effort

estimates.

The rest of this section details the above process.

4.1. Finding Contrast Sets

Once a contrast set learner is available, it is a simple

matter to add W to CBR. W finds contrast sets using a

greedy search, where candidate contrast sets are ranked

by the log of the odds ratios. Let some attribute range x

appear at frequency N1 and N2 in two regions of size R1

and R2. Let the R1 region be the preferred goal and R2 be

some undesired goal. The log of the odds ratio, or LOR, is:

LOR(x) = log ((N1/R1) (N2/R2)) (2)

Note that when LOR(x) = 0, then x occurs with the

same probability in each region (such ranges are there-

fore not useful for selecting on region or another). On the

other hand, when LOR(x) > 0, x is more common in the

preferred region than otherwise. These LOR-positive

ranges are candidate members of the contrast set that

selects for the desired outcome.

It turns out that, for many data sets, the LOR values

for all the ranges contain a small number of very large

values (strong contrasts) and a large number of very

small values (weak contrasts). The reasons for this dis-

tribution do not concern us here (and if the reader is in-

terested in this master-variable effect, they are referred to

[19,20]). What is relevant is that the LOR can be used to

rank candidate members of a contrast set. W computes

the LORs for all ranges, and then conducts experiments

applying the top i-th ranked ranges.

For more on LOR, and their use for multi-dimensional

data, see [21].

4.2. The Algorithm

CBR systems input a query q and a set of cases. They

return the subset of cases C that is relevant to the query.

In the case of W:

 Each case Ci is an historical record of one software

Case-Based Reasoning for Reducing Software Development Effort 1009

projects, plus the development effort required for

that project. Within the case, the project is de-

scribed by a set of attributes which we assume have

been discretized into a small number of discrete

values (e.g. analyst capability ∈{1, 2, 3, 4, 5} de-

noting very low, low, nominal, high, very high re-

spectively).

 Each query q is a set of constraints describing the

particulars of a project. For example, if we were in-

terested in a schedule over-run for a complex, high

reliability projects that have only minimal access to

tools, then those constraints can be expressed in the

syntax of Figure 2.

W seeks q' (a change to the original query) that finds

another set of cases C' such that the median effort values

in C' are less than that of C (the cases found by q). W

finds q' by first dividing the data into two-thirds training

and one-third testing. Retrieve and reuse are applied to

the training set. Revising is then applied to the test set.

1) Retrieve: The initial query q is used to find the N

training cases nearest to q using a Euclidean distance

measure where all the attribute values are normalized

from 0 to 1.

2) Reuse (adapt): The N cases are sorted by effort and

divided into the K1 best cases (with lowest efforts) and

K2 rest cases. For this study, we used K1 = 5, K2 = 15.

Then we seek the contrast sets that select for the K1 best

cases with lowest estimates. All the attribute ranges that

the user has marked as “controllable” are scored and

sorted by LOR. This sorted order S defines a set of can-

didate q' queries that use the first i-th entries in S:

qi' = q∪S1∪S2…∪Si (3)

Formally, the goal of W is find the smallest i value

such qi' selects cases with the least median estimates.

According to Figure 3, after retrieving and reusing

comes revising (this is the “verify” step). When revising

q', W prunes away irrelevant ranges as follows:

1) Set i = 0 and qi' = q.

2) Let Foundi be the test cases consistent with qi' (i.e.,

that do not contradict any of the attribute ranges in qi').

3) Let Efforti be the median efforts seen in Foundi.

@project example

@attribute ?rely 3 4 5

@attribute tool 2

@attribute cplx 4 5 6

@attribute ?time 4 5 6

Figure 2. W’s syntax for descr ibing the input quer y q. Here,

all the values run 1 to 6. 4 ≤ cplx ≤ 6 denotes projects with

above average complexity. Question marks denote what can

be controlled in this case, rely, time (require d reliability and

development time).

Figure 3. A diagram describing the steps of CBR (source:

http://www.peerscience.com/Assets/cbrcycle1.gif).

4) If Found is too small then terminate (due to

over-fitting). After Shepperd [2], we terminated for

|Found| < 3.

5) If i > 1 and Efforti < Efforti-1, then terminate (due to

no improvement).

6) Print qi' and Efforti.

7) Set i = i + 1 and qi' = qi-1∪Si

8) Go to step 2.

On termination, W recommends changing a project

according to the set q' – q. For example, in Figure 2, if q'

– q is rely = 3 then this treatment recommends that the

best way to reduce the effort for this project is to reject

rely = 4 or 5.

One useful feature of the above loop is that it is not a

black box that offers a single “all-or-nothing” solution.

Rather it generates enough information for a user to

make their cost-benefit tradeoffs. In practice, users may

not accept all the treatments found by this loop. Rather,

for pragmatic reasons, they may only adopt the first few

Si changes seen in the first few rounds of this loop. Users

might adopt this strategy if (e.g.) they have limited man-

agement control of a project (in which case, they may

decide to apply just the most influential Si decisions).

Implementing W is simpler than the STAR/NOVA

approach:

 Both NOVA and STAR contain a set process mod-

els for predicting effort, defects, and project threats

as well as Monte Carlo routines to randomly select

values from known ranges. STAR and NOVA im-

plement simulated annealing while NOVA also im-

plements other search algorithms such as A*, LDS,

Case-Based Reasoning for Reducing Software Development Effort

1010

MAXWALKSAT, beam search, etc. STAR/NOVA

are 3000 and 5000 lines of C++ and LISP, respec-

tively.

 W, on the other hand, is a 300 line AWK script.

Our pre-experimental suspicion was that W was too

simple and would need extensive enhancement. However,

the results shown below suggest that, at least for this task,

simplicity can suffice (but see the future work section for

planned extensions).

Note that W verification results are more rigorous than

those of STAR/NOVA. W reports results on data that is

external to its deliberation process (i.e., on the test set).

STAR/NOVA, on the other hand, only reported the

changes to model output once certain new constraints

were added to the model input space.

5. Data

Recall that a CBR system takes input a query q and cases

C. W has been tested using multiple queries on the data

sets of Figure 4. These queries and data sets are de-

scribed below.

5.1. Data Sets

As shown in Figu re 4, our data includes:

 The standard public domain COCOMO data set

(Cocomo81);

 Data from NASA;

 Data from the International Software Benchmarking

Standards Group (ISBSG);

 The Desharnais and Maxwell data sets.

Except for ISBSG, all the data used in this study is

available at http://promisedata.org/data or from the au-

thors.

Note the skew of this data (min to median much

smaller than median to max). Such asymmetric distribu-

tions complicate model-based methods that use Gaussian

approximations to variables.

There is also much divergence in the attributes used in

our data:

 While our data include effort values (measured in

terms of months or hours), no other feature is

shared by all data sets.

 The COCOMO and NASA data sets all use the at-

tributes defined by Boehm [9]; e.g. analyst capabil-

ity, required software reliability, memory con-

straints, and use of software tools.

 The other data sets use a variety of attributes such

as the number of data model entities, the number of

basic logical transactions, and number of distinct

business units serviced.

This attribute divergence is a significant problem for

model-based methods like STAR/NOVA since those

systems can only accept data that conforms to the space

of attributes supported by their model. For example, this

study uses the five data sets listed in Figure 2. STAR/

NOVA can only process two of them (coc81 and nasa93).

CBR tools like W, on the other hand, avoid these two

problems:

 W makes no assumption about the distributions of

the variables.

 W can be easily applied to any attribute space (ca-

veat: as long but there are some dependent vari-

ables).

5.2. Queries

Figure 2 showed an example of the W query language:

 The idiom “@attribute name range “defines the

range of interest for some attribute “name”.

 If “range” contains multiple values, then this repre-

sents a disjunction of possibilities.

 If “range” contains one value, then this represents a

fixed decision that cannot be changed.

 The idiom “?x” denotes a controllable attribute (and

W only generates contrast sets from these control-

lables). Note that the “range”s defined for “?x”

must contain more than one value, otherwise there

is no point to making this controllable.

6. Case Studies

6.1. Case Study #1: Brooks’ Law

This section applies W to Brooks’ Law. Writing in the

1970s [22], Brooks noted that software production is a

very human-centric activity and managers need to be

aware of the human factors that increase/decrease pro-

ductivity. For example, a common practice at that time at

IBM was to solve deadline problems by allocating more

resources. In the case of programming, this meant

Dataset Attributes Number of cases Content Units Min MedianMean Max Skewness

coc81 17 63 NASA projects months 6 98 683 11400 4.4

nasa93 17 93 NASA projects months 8 252 624 8211 4.2

desharnais 12 81 Canadian software projects hours 546 3647 5046 23940 2.0

maxwell 26 62 Finnish banking software months 6 5189 8223 63694 3.3

isbsg 14 29 Banking projects of ISBSG minutes662 2355 5357 36046 2.6

Figure 4. The 328 projects used in this study come from 5 data sets.

Case-Based Reasoning for Reducing Software Development Effort 1011

@project brooks

@attribute apex 1

@attribute plex 1

@attribute ltex 1

@attribute kloc 0.9 2.2 3 [snip] 339 350 352 423 980

Figure 5. The brookslaw query.

Treatment median spread

AsIs = Nasa93 225 670

ToBe1 = nasa93 ∪ q 380 680

ToBe2 = nasa93 ∪ q' 220 290

Figure 6. Effort estimates seen in different treatments for

the brooks’ law experiment.

adding more programmers to the team. Brooks argued

that this was an inappropriate response since, according

to Brooks’ law “adding manpower [sic] to a late software

project makes it later”. The reason for this slowdown is

two-fold:

 The more people involved the greater the commu-

nication overhead. While this is certainly an issue if

all parts of the software system are accessible to all

other parts, with an intelligent module design, this

first issue can be mitigated.

 The second issue is more fundamental. Software

construction is a complex activity. Newcomers to a

project suffer from inexperience in the tools, the

platform, the problem domain, etc.

The query of Figure 5 models this second issue. In

this query, all the experience attributes have been set to

their lowest value (apex, plex, ltex are analyst experience,

programmer language experience, and language and tool

experience, respectively). The remaining attributes are

all controllable and are allowed to move over their full

range. For a (dataset, query) of (nasa93, q = brookslaw),

W returns

q' = q ∪ (data = 2) (4)

That is, W is recommended setting the database size to

its lowest value. Databases are used to store program and

data elements. In effect, W is recommending reigning in

the scope of the project. The recommendation can be

paraphrased as follows:

If the project is late, and you add more staff with less

experience, you can still finish on time if you decrease

the scope of the project.

Figure 6 shows the effects of this recommendation.

The AsIs row shows the median and spread of the effort

values in nasa93. The ToBe1 row shows the effect of

Brooks’ Law. In the subset of the data consistent with

aexp = plex = ltex = 1, the median effort has nearly dou-

bled. The ToBe2 row shows the impact of W’s recom-

mendation: the project will now finish in nearly the time

as the AsIs row, and the spread is greatly reduced.

One of the reasons we are exploring W is the simplic-

ity of the implementation. In this regard, it is useful to

compare our results on Brooks’ Law to other researchers.

Brooks’ Law is a well researched effect and other re-

searchers have found special cases where the general law

does not hold. For example, using sophisticated qualita-

tive reasoning techniques, Zhang et al. [23] found their

own loopholes in Brooks’ Law. One of us (Keung)

worked on site with the Zhang team and reports that the

Brooks analysis was the main result of a two year mas-

ters graduate thesis. In contrast, writing W took three

days and the specific analysis of Brooks’ Law took

twenty minutes from first posing the question, to graph-

ing the output.

6.2. Thirteen More Case Studies

Apart from the Brooks’ Law experiment, we have tested

W on thirteen other case studies:

 For the ISBSG data set, we used our recent experi-

ence to describe the constraints suitable for a

stand-alone or client server system (denoted

stdalone and clientsrv).

 For the Desharnais data set, we posed queries rep-

resenting:

○ (s; m; l) denotes (small, medium,large) pro-

jects;

○ (mngr; team) denotes (manager,team) experi-

ence being low.

 For the Cocomo and NASA data sets, we used our

contacts at the Jet Propulsion Laboratory to write

queries describing (a) osp (see above); (b) the sec-

ond version of that system called osp2; as well as (c)

generic flight and (d) ground systems.

 Lacking direct experience with the Finnish financial

system, we could not pose specific queries to the

Maxwell dataset. Instead, we made half the attrib-

utes controllable and used that for the Maxwell

query.

Figure 7 shows the improvements seen in our 13 que-

ries, running on the data sets of Figure 4. As shown by

the last line of Figure 7, the usual improvements where

(36, 68) for (median, spread). Note that, unlike

STAR/NOVA, these are results on real-world data sets

not used during training.

Figure 8 displays the Figure 7 results graphically.

The dashed line indicates the median of the improve-

ments for each axis. One data set had consistently worse

results than any other. The gray cells of Figure 7 indi-

cate when W failed (i.e., where the treatments increased

median development effort or spread, or both). Note that

the gray cells are found only in the Desharnais results.

On investigation, the root cause of the problem was

Case-Based Reasoning for Reducing Software Development Effort

1012

Improvement

dataset query q median spread

coc81 ground %95 %83

coc81 osp %82 %68

93nasa ground %77 %15

ISBSG stdalone %69 %100

93nasa flight %61 %73

93nasa osp %49 %48

maxwell %44 %76

ISBSG clientserv%42 %88

desharnais mngr-l %36 %45-

81coc 2osp %31 %71

93nasa 2osp %27 %42

desharnais mngr-s %23 %85

desharnais mngr-m %23 %85

81coc flight %0 %18

desharnais team-m %0 %60

desharnais team-s %15- %206-

desharnais team-l %15- %93-

median %36 %68

Figure 7. Improvements (100 * (initial - final)/initial) for 13

queries, sorted by median improvement. Gray cells show

negative improvement.

Figure 8. Median and spread improvements for effort esti-

mation. dashed lines mark median values.

the granularity of the data. Whereas (e.g.) coc81 assigns

only one of six values to each attribute, Desharnais’ at-

tributes had a very wide range.

Currently, we are exploring discretization policies to

[24,25] reduce attributes with large cardinality to a

smaller set. Tentatively, we can say that discretization

solves the problem with Desharnais but we are still

studying this aspect of our system. Even counting the

negative Desharnais results, in the majority of cases W

found treatments that improved both the median and

spread of the effort estimates. Sometimes, the improve-

ments were modest: in the case of (coc81, flight), the

median did not improve (but did not get worse) while the

spread was only reduced by 18%. But sometimes the

improvements are quite dramatic.

 In the case of (ISBSG, stdlone), a 100% improve-

ment in spread was seen when q' selected a group of

projects that were codeveloped and, hence, all had

the same development time.

 In the case of (coc81, ground), a 95% effort im-

provement was seen when q' found that ground

systems divide into two groups (the very expensive,

and the very simple). In this case W found the fac-

tor that drove a ground system into the very simple

case.

7. Discussion

7.1. Comparisons to NOVA

Figure 9 shows that estimation improvements found by

W (in this report) to the improvements reported previ-

ously (in Figure 1). This table is much shorter than Fig-

ure 7 since, due to the modeling restrictions imposed by

the software process models, NOVA cannot be applied to

all the data sets that can be processed by W. The num-

bers in Figure 9 cannot be directly compared due to the

different the different goals of the two systems: W tries

to minimize effort while NOVA tries to minimize effort

and development time and delivered defects (we are cur-

rently extending W to handle such multiple-goal optimi-

zation tasks). Nevertheless, it is encouraging to note that

the results are similar and that the W improvements are

not always less than those found by STAR/NOVA.

Regardless of the results in Figure 9, even though we

prefer W (due to the simplicity of the analysis) there are

clear indicators of when we would still use STAR/

NOVA. W is a case-based method. If historical cases are

not available, then STAR/NOVA is the preferred method.

On the other hand, STAR/NOVA is based on the USC

COCOMO suite of models. If the local business users do

not endorse that mode, then W is the preferred method.

Treatment median spread

AsIs = Nasa93 225 670

ToBe1 = nasa93 ∪ q380 680

ToBe2 = nasa93 ∪ q' 220 290

Figure 9. Comparing improvements found by NOVA (fr om

Figure 1) and W (Figure 7).

Case-Based Reasoning for Reducing Software Development Effort 1013

7.2. Threats to Validity

External validity is the ability to generalize results out-

side the specifications of that study [26]. To ensure the

generalizability of our results, we studied a large number

of projects. Our datasets contain a wide diversity of pro-

jects in terms of their sources, their domains and the time

period they were developed in. Our reading of the litera-

ture is that this study uses more project data, from more

sources, than numerous other papers. Table 4 of [27] lists

the total number of projects in all data sets used by other

studies. The median value of that sample is 186, which is

less much less than the 328 projects used in our study.

Internal validity questions to what extent the cause-effect

relationship between dependent and independent vari-

ables hold [28]. For example, the above results showed

reductions in the effort estimates of up to 95%; i.e., by a

factor of 20. Are such massive reductions possible?

These reductions are theoretically possible. Making

maximal changes to the first factor (personnel/team ca-

pability) can affect the development effort by up to a

3.53. Making maximal changes just to the first four fac-

tors could have a net effect of up to:

3.53 * 2.38 * 1.62 * 1.54 ≈ 21 > 20 (5)

By the same reasoning, making maximal changes to

all factors could have a net effect of up to eleven thou-

sand. Hence, an improvement of 95% (or even more) is

theoretically possible. As to what is pragmatically possi-

ble, that is a matter for human decision making. No

automatic tool such as STAR/NOVA/W has access to all

the personnel factors and organizational constraints

known to a human manager. Also, some projects are in-

herently expensive (e.g. the flight guidance system of a

manned spacecraft) and cutting costs by, say, reducing

the required reliability of the code is clearly not appro-

priate. Tools like W are useful for uncovering options

that a human manager might have missed, yet ultimately

the actual project changes must be a human decision.

8. Conclusions

If a manager is given an estimate for developing some

software, they may ask “how do I change that?” The

model variance problem makes it difficult to answer this

question. Our own prior solution to this problem required

an underlying process model that limited the data sets

that can be analyzed.

This paper has introduced W, a case-based reasoning

approach (augmented with a simple linear-time greedy

search for contrast sets). W provides a mechanism that

allows projects to improve over time based on historical

events, greatly assist in the project planning and resource

allocation. W has proven to be simpler to implement and

use than our prior solutions. Further, since this new me-

thod has no model-based assumptions, it can be applied

to more data sets. When tested on 13 real-world case

studies, this approach found changes to projects that

could greatly reduce the median and variance of the ef-

fort estimate.

REFERENCES

[1] B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K.

Clark, B. Steece, A. W. Brown, S. Chulani and C. Abts,

“Software Cost Estimation with Cocomo II,” Prentice

Hall, New Jersey, 2000.

[2] M. Shepperd and C. Schofield, “Estimating Software

Project Effort Using Analogies,” IEEE Transactions on

Software Engineering, Vol. 23, No. 11, November 1997,

pp. 736-743.

[3] Y. Li, M. Xie and T. Goh, “A Study of Project Selection

and Feature Weighting for Analogy Based Software Cost

Estimation,” Journal of Systems and Software, Vol. 82,

No. 2, February 2009, pp. 241-252.

[4] O. El-Rawas, “Software Process Control without Calibra-

tion,” Master’s Thesis, Morgantown, 2008.

[5] T. Menzies, O. El-Rawas, J. Hihn and B. Boehm, “Can

We Build Software Faster and Better and Cheaper?”

Proceedings of the 5th International Conference on Pre-

dictor Models in Software Engineering (PROMISE’09),

2009.

[6] T. Menzies, O. Elrawas, J. Hihn, M. Feathear, B. Boehm

and R. Madachy, “The Business Case for Automated Soft-

ware Engineering,” Proceedings of the Twenty-second

IEEE/ACM International Conference on Automated Soft-

ware Engineering (ASE’07), 2007, pp. 303-312.

[7] T. Menzies, S. Williams, O. El-rawas, B. Boehm and J. Hihn,

“How to Avoid Drastic Software Process Change (Using

Stochastic Statbility),” Proceedings of the 31st International

Conferenc e on Softwar e Eng ineeri ng (ICSE’09), 2009.

[8] K. Cowing, “Nasa to Shut down Checkout & Launch Con-

trol System,” August 26, 2002. http://www.spaceref.com

/news/viewnews.html

[9] B. Boehm, “Software Engineering Economics,” Prentice

Hall, New Jersey, 1981.

[10] C. Kemerer, “An Empirical Validation of Software Cost

Estimation Models,” Communications of the ACM, Vol.

30, No. 5, May 1987, pp. 416-429.

[11] D. Baker, “A Hybrid Approach to Expert and Mod-

el-Based Effort Estimation,” Master’s Thesis, Morgan-

town, 2007.

[12] P. Green, T. Menzies, S. Williams and O. El-waras, “Un-

derstanding the Value of Software Engineering Tech-

nologies,” Proceedings of the 2009 IEEE/ACM Interna-

tional Conference on Automated Software Engineering

(ASE’09), 2009.

[13] T. Menzies, O. Elrawas, D. Baker, J. Hihn and K. Lum,

“On the Value of Stochastic Abduction (If You Fix Eve-

rything, You Lose Fixes for Everything Else),” Interna-

tional Workshop on Living with Uncertainty (An ASE’07

Case-Based Reasoning for Reducing Software Development Effort

1014

Co-Located Event), 2007.

[14] R. C. Schank, “Dynamic Memory: A Theory of Remind-

ing and Learning in Computers and People,” Cambridge

University Press, New York, 1983.

[15] A. Aamodt and E. Plaza, “Case-Based Reasoning: Foun-

dational Issues, Methodological Variations and System

Approaches,” Artificial Intelligence Communications,

Vol. 7, No. 1, 1994, pp. 39-59.

[16] M. J. Shepperd, “Case-Based Reasoning and Software

Engineering,” Technical Report TR02-08, Bournemouth

University, UK, 2002.

[17] G. Kadoda, M. Cartwright, L. Chen and M. Shepperd,

“Experiences Using Casebased Reasoning To Predict

Software Project Effort,” Keele University, Staffordshire,

2000.

[18] E. Mendes, I. D. Watson, C. Triggs, N. Mosley and S.

Counsell, “A Comparative Study of Cost Estimation Mod-

els for Web Hypermedia Applications,” Empirical Soft-

ware Engineering, Vol. 8, No. 2, June 2003, pp. 163-196.

[19] T. Menzies, D. Owen and J. Richardson, “The Strangest

Thing about Software,” IEEE Computer, Vol. 40, No. 1,

January 2007, pp. 54-60.

[20] T. Menzies and H. Singh, “Many Maybes Mean (Mostly)

the Same Thing,” In: M. Madravio, Ed., Soft Computing

in Software Engineering, Springer-Verlag, Berlin Hei-

delberg, 2003.

[21] M. Možina, J. Demšar, M. Kattan and B. Zupan, “Nomo-

grams for Visualization of Naive Bayesian Classifier,”

Proceedings of the 8th European Conference on Princi-

ples and Practice of Knowledge Discovery in Databases

(PKDD’04), Vol. 3202, 2004, pp. 337-348.

[22] F. P. Brooks, “The Mythical Man-Month,” Anniversary

Edition, Addison-Wesley, Massachusetts, 1975.

[23] H. Zhang, M. Huo, B. Kitchenham and R. Jeffery,

“Qualitative Simulation Model for Software Engineering

Process,” Proceedings of the Australian Software Engi-

neering Conference (ASWEC’06), 2006, pp. 10-400.

[24] J. Dougherty, R. Kohavi and M. Sahami, “Supervised and

Unsupervised Discretization of Continuous Features,”

International Conference on Machine Learning, 1995, pp.

194-202.

[25] Y. Yang and G. I. Webb, “A Comparative Study of Dis-

cretization Methods for Naive-Bayes Classifiers,” Pro-

ceedings of PKAW 2002: The 2002 Pacific Rim Knowl-

edge Acquisition Workshop, 2002, pp. 159-173.

[26] D. Milicic and C. Wohlin, “Distribution Patterns of Effort

Estimations,” 30th EUROMICRO Conference (EU-

ROMICRO'04), 2004.

[27] B. Kitchenham, E. Mendes and G. H. Travassos, “Cross

Versus Within-Company Cost Estimation Studies: A Sys-

tematic Review,” IEEE Transactions on Software Engi-

neering, Vol. 33, No. 5, May 2007, pp. 316-329.

[28] E. Alpaydin, “Introduction to Machine Learning,” MIT

Press, Cambridge, 2004.