Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

doi:10.4236/jsea.2011.48058

Paper Menu >>

Journal Menu >>

Journal of Software Engineering and Applications, 2011, 4, 497-511

doi:10.4236/jsea.2011.48058 Published Online August 2011 (http://www.SciRP.org/journal/jsea)

497

Temporal Patterns of Software Evolution Defects:

A Comparative Analysis of Open Source and

Closed Source Projects

Uzma Raja, Joanne Elaine Hale, David Peter Hale

Department of Information Systems, Statistics, and Management Science, The University of Alabama, Tuscaloosa, Alabama, USA.

Email: uraja@cba.ua.edu, jhale@cba.ua.edu, dhale@cba.ua.edu

Received January 22nd, 2011, revised March 30th, 2011, accepted A pr i l 10 th, 2011.

ABSTRACT

This study examines temporal patterns of software systems defects using the Autoregressive Integrated Moving Average

(ARIMA) approach. Defect reports from ten software application projects are analyzed; five of these projects are open

source and five are closed source from two software vendors. Across all samp led projects, the ARIMA time series mod-

eling technique provides accurate estimates of reported defects during software maintenance, with organizationally

dependent parameterization. In contrast to causal models that require extraction of source-code level metrics, this ap-

proach is based on readily available defect report data and is less computation intensive. This approach can be used to

improve software maintenance and evolution resource allocation decisions and to identify outlier projects—that is, to

provide evidence of unexpected defect reporting patterns that may indicate troubled projects.

Keywords: Op en Source Software, Software Defects, Software Maintenance, Time Series Analysis

1. Introduction

Today’s software systems are fragile [1], particularly

when new software releases are deployed [2]. The falli-

bility of software applications and their underlying op-

eration systems is seemingly inevitable [3]. As a result,

sixty to eighty percent of the typical firm’s total software

budget is allocated to software maintenance [4,5]. In ad-

dition, an entire business function and support industry

has grown up to handle the problems as they occur [6].

Operational planning within such organizations may take

several forms. Some organizations attempt to ramp up

and down maintenance staff and related resources (such

as test harnesses, software maintenance tools, and testing

environments) in response to task arrival rate fluctuations.

Other organizations respond by keeping resources stable

which results in oscillation between resource over-utili-

zation (and the resulting increased wait time for software

patches, decreased user satisfaction and business value)

and resource under-utilizatio n (and the resulting resource

idling and increased cost).

Stark and Oman [7] provide alternative staffing and

release schedule strategies responding to user detected

software defect reports. Anchored at one extreme, a fixed

capacity staff can be assigned to respond to defect re ports

as received, with upgrade releases occurring at fixed in-

tervals. At the other extreme, staff augmentation can be

used to provide resources as needed and upgrade release

times adjusted to aggregate related changes. Between

these extremes, additional strategies are implemented in

practice that provide for variable staffing, but fixed

schedule periods; or fixed staffing with variable lengths

of time between upgrades. To evaluate the potential fu-

ture benefit of any of these strategy alternatives requires

knowledge of the potential distribution pattern of the

reported defe cts.

The manager’s choice in resource planning approaches

is critical. In recent work, Chulani et al. [8] identified the

interval between reporting and fixing defects as the

dominate factor in user satisfaction; this dominance out-

strips even the number of defects. To maintain user sat-

isfaction, resources must be available to resolve defects

and promptly make the system operate as expected. This

result is necessary to the use of information systems as a

vital component in business operations. These observa-

tions lead naturally to the operational planning question:

Is there a model to aid in predicting when resources

will be needed?

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

498

Secondarily, if such a predictive model for software

maintenance resources can be derived:

Is such a predictive model computationally and eco-

nomically practi cal?

These questions have yet to be adequately addressed,

as according to Pelayo and Dick [9] “no parametric

model has ever been developed that accurately forecasts

the number or occurrence of faults [defects] in a software

module.” To meet this research challenge, this study

seeks to develop an accurate predictive model of soft-

ware defect patterns that can be applied to the larger

problem of software maintenance resource allocation and

alignment, while using readily available defect report

data and computational resources.

2. Background

During peak shopping times, retailers increase their staff

of floor professionals and cashiers. When few truckloads

are expected to arrive, a distribution center manager

schedules fewer fork lift operators. Software mainte-

nance managers are faced with similar arrival rate fluc-

tuations that impact resource requirements. Software

maintenance managers must ensure product quality and

required service levels, while simultaneously minimizing

costs associated with defect resolution and penalties for

non-performance [10,11]. Faced with this challenge,

formal predictive models are not common in resource

planning; instead maintenance planning methods in prac-

tice continue to be largely ad hoc [12], with recent per-

sonal experience weighing heavily on practitioner pre-

dictions of change requests and staffing needs [10,12].

As a result, maintenance project managers too often ei-

ther overstaff (causing resources to idle and costs to in-

crease) or understaff (causing delays in defect resolution

and a decline in user satisfaction and business value).

Previous work is reviewed regarding predicting soft-

ware defects, where a defect is defined as a reported error

that is encountered in an operational software applicatio n.

The predominance of prior research does not focus on

patterns of discovered defects once the application is in

use. Instead, a strong body of research exists that predicts

software defects during the development of new systems.

Both areas are explored and fall into three classes of

forecasting approaches: causal, learning and time-series.

2.1. Causal Models

Many researchers have constructed models to predict the

number of defects remaining in completed software

products or identify defect-dense modules within a sys-

tem. Ohlsson et al. [13] use principal components analy-

sis and classification trees to identify fault-prone com-

ponents. Khoshgoftaar and Lanning [14] used a neural

network technique to classify modules as high or low risk

for defects based on quality and complexity metrics in-

cluding the number of fault-correcting and enhancive

changes. El Emam and Laitenberger [15] used a Monte

Carlo simulation to evaluate the accuracy of a capture-

recapture re-inspection de fe c t prediction model.

Khoshgoftaar et al. [16] constructed a nonlinear re-

gression model predicting the number of faults using

lines of code. Fenton and Neil [17] and Adams [18] dis-

covered that post-release defects were more likely in

modules where few defects were discovered pre-release

and that testing effectiveness significantly impacts the

post-release presence of defects. Krishnan and Kellner

[19] found that organizations that consistently followed

Capability Maturity Mo del (CMM) practices ex perienced

significantly fewer reported field defects in the resulting

software. Krishnan [20] found that higher levels of do-

main experience of the software team are associated with

a reduction in the number of field defects in the product.

However, there is no significant association between either

the language or the domain experience of the software

team and the costs incurred in developing the product.

Such causal predictive models of defects identify the

factors that impact software defects, thus serving both

predictive and explanatory roles regarding what factors

could be controlled to manage future defects. Such mod-

els are useful for software development teams, since they

can control these variables and manage the overall qual-

ity of software system. Most of these models however

require access to internal characteristics of software.

Although available for decades for use in staffing and

system quality and defect modeling, these causal models

have not been widely used in practice because of the cost

and complexity of implementation [21]. Further compli-

cating their use during software maintenance, mainte-

nance practitioners have little control over the internal

characteristics commonly modeled to predict defects

(largely set at time of product release), thus rendering

such complex models of little use to maintenance man-

agers who want to manage and allocate budget, time and

resources for future defect occurrences.

2.2. Learning Techniques

A number of authors have investigated the use of ma-

chine-learning techniques for software defect prediction.

Some examples include neural networks [22], genetic

programming [23], fuzzy clustering ([24] and decision

trees [25]. For example, Seliya et al. [26] proposed a

semi-supervised clustering method to detect failures in

software modules. Instead of working with the individual

modules on software, they group modules and label them

as fault prone or not fault prone.

Fenton and Neil [17] used Bayesian belief networks

(BBN) as an effective approach for defect prediction, an

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects499

ptpt

approach that is gaining popular ity [27]. Building on this

work, Menzies et al. [28] showed positive results using a

naïve Bayes classifier with log-filtered static code meas-

ures.

Challagulla et al. [29] used simulation to compare

software prediction using stepwise regression, rule in-

duction, case-based reasoning, and artificial neural net-

works. They concluded that stepwise regression per-

formed better with contin uous target functions, while the

other machine learning approaches performed better for

discontinuous target functions. They favored case based

reasoning since it appeared to be the best all round pre-

dictor by a small margin. Song et al. [30] investigated the

above prediction models on real software data, compar-

ing them in terms of accuracy, explanatory value, and

configurability. They concluded that the explanatory

value of case-based reasoning and rule induction gives

them an advantage over neural nets, which have prob-

lems of configuration. Aljahdali et al. [31] compared

regression with neural nets for prediction of software

reliability and concluded that for most cases neural nets

provided fewer errors than regression models.

These adaptive, learning based predictive models have

been found to i mprove on the accuracy of traditional sta-

tistical linear causal models. However, they still fail to

meet the ease of implementation goal of this study, as

they require professionals with specialized model know-

ledge and sophisticated software not typically at the dis-

posal of a maintenance manager.

2.3. Time Series Models

Causal and learning models are both computationally

complex and require significant investments in project

data collection. In response to these challenges, the goal

of this study is to provide a method of predicting patterns

in software defects that is accurate without the cost and

complexity of more traditional predictive methods.

Time series models assume that events are correlated

over time and the impact of other factors is progressively

captured in historical archives [32]. The most commonly

used forecasting method, time series models are fre-

quently used to predict product demand [33], macro-eco-

nomic trends [34], and retail sales [35], but are yet to be

widely adopted in the software maintenance domain [36].

Within the domain of software maintenance, time se-

ries modeling has had limited use. Kemerer and Slaugh-

ter [37] used ARIMA modeling to predict monthly

changes, not reported defects, in software. Kenmei et al.

[38] and Raja et al. [36] created time series models for

defects in open source software (OSS) and found that the

ARIMA models outperform the accuracy of simple mod-

els. Each research team found that time series modeling

was a suitable and accurate method of defect prediction

for large-scale OSS projects. However neither of the lat-

ter studies investigated proprietary closed source soft-

ware applications.

Thus based on results in the literature, time series

analysis potentially provides a method of predicting pat-

terns in software defects that is accurate without the cost

and complexity of causal and learning models. It is left to

this study to determine whether the results found in OSS

projects can be replicated across open and closed source

software (CSS) applications.

3. Methods

This work builds on previous studies that discovered the

accuracy and ease of implementation of time series soft-

ware defect prediction. Specifically this research com-

pares the defect evolution patterns across a diverse set of

projects, providing the opportunity to study projects

within and across organizations. This section describes

the prediction model adopted in this study, the software

maintenance projects examined, the associated data ex-

tracted, and the analytical techniques used.

3.1. Time Series Analysis

As proposed in prior studies [32,36], time series an alysis

offers promise in the field of software defect prediction.

These models are suited for representing situations char-

acterized by frequent variations, such as the pattern of

software defect occurrences. A time series is a collection

of observations made over equal intervals of time that can

be used to predict future values and to identify trends [39] .

A wide variety of time series modeling techniques are

available and their su itability depends upon the nature of

the data. A Moving Average series (MA) explain present

as a mixture of random impulses, while an Autoregres-

sive (AR) model builds the present in terms of past val-

ues. These series are suitable for data that is stationary in

nature i.e. its statistical properties (e.g. mean, variance,

autocorrelation) are constant over time.

For cases in which there is evidence of data being

non-stationary as opposed to stationary, Box and Jenkins

[40] introduced a corresponding generalized model. This

model is called Autoregressive Integrated Moving Aver-

age (ARIMA). T he general form of ARI M A (p,d,q) is:

011

tqtq

YYY





 









 





 (1)

where:

Yt = time series of the variable y.



= coefficient associated with Yt, to be estimated

using least squares.



= the defect term, assumed to be independent,

identically distributed variables sampled from a normal

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

500

distribution with zero mean.



= the coefficient associated with t



to be esti-

mated using least squares.

As detailed in the following subsections, the ARIMA

modeling strategy followed in this study is comprised of

four steps: Identification, Estimation, Diagnostic Testing

and Application .

Model Identification: The first step in model identifi-

cation is often to apply a logarithmic transformation to

stabilize the variance of a series. Then the model is pa-

rameterized as ARIMA (p,d,q), where

p = order of the Autoregressive component.

d = order of the Dif ferenced com ponent.

q = order of the Moving Average component.

During model identification, the time series is ana-

lyzed to assess what values of the parameters p, d, and q

are most appropriate. The value of d is set taking into

account whether the series is stationary (d = 0) or non-

stationary (d > 0).

Estimation: The original or transformed time series is

then modeled using the parameters and identified in the

previous step to estimate the coefficients



in Equation

(1). The different candidate values of p and q are used to

compute the respective coefficient. The final model is

selected using goodness of fit tests. Where goodness of

fit is equivalent, t he m ost parsim onious m odel i s selected.

Diagnostic Testing: The residuals are computed as the

difference of the actual and predicted values (using the

identified mode). These residuals are then analyzed using

known techniques to determine the adequacy of the

model. The residuals of a good model are expected to be

small and random.

Model Application: The predictive model accuracy on

unseen data is estimated using a hold-out data sample

[41]. Using this approach, a subset of the time series data

is withheld from use in parameter estimation, and is in-

stead used to test the model’s accuracy.

3.2. Site Selection and Data Description

To study patterns in defect arrival rates, projects from a

diverse set of organizations, problem domains, teams,

and development methodologies were selected. The

closed source software data was acquired from two or-

ganizations. Organization A is a large diversified interna-

tional software consulting firm, with a mature method-

ology environment; all Organization A development

groups are currently assessed at Capability Maturity

Model Integration (CMMI) Level 3 or higher. Data for

three Organization A projects was obtained, denoted in

this study as Project A1, Project A2, and Project A3.

Organization B is a small (30 employee) privately

owned provider of financial transaction automation soft-

ware using agile methodologies. Data for two Organiza-

tion B projects was obtained, denoted in this study as

Project B1 and Project B2.

In addition to the five CSS projects (three from Or-

ganization A and two from Organization B) five projects

not included in Raja et al. [36] study were randomly se-

lected from the list of the top twenty most active OSS

projects within the SourceForge repository. Inclusion of

these five projects provides process replication and ex-

tends the sample set coverage by more than 50% to the

OSS projects as evaluated by Raja et al. [36]. Descrip-

tions of the OSS and CSS projects included in this study

are presented in Table 1.

Each of the ten studied projects has one or more arti-

fact repositories that store information regarding various

artifact types e.g. defects, patches, and feature requests.

Defects of an individual project can be extracted using

the unique defect repository identifier, available in each

artifact. The defect data also includes the time of defect

submission. The data is then aggregated to compute

monthly defects for each project. Table 1 shows the st art

date, number of months of available data and the total

number of defects for each of the sampled projects.

3.3. Variable Specification and Data Extraction

Time series modeling requires that data are gathered

across equally spaced time intervals. Consistent with the

commonly used resource planning interval, a monthly

count of software defects was computed for each project.

The model accuracy is sensitive to the length of historical

data available. Therefore projects with a minimum 50

months of data available were used in the analysis. This

also ensures that there is enough data available for hold-

out sampling and testing of the accuracy of the model

forecasts.

For OSS projects, the SourceForge.net defect-tracking

repository holds archives of defect reports for the pro-

jects hosted by that community. Organization A and B

host their own internal defect tracking repositories for

trouble resolution. In all three environments, the data

dictionary of the repository was used to identify the arti-

fact repository of defects. SQL queries were used to ex-

tract individual project defect data from the hosting ar-

chive warehouse. Further queries were used to compute

monthly statistics of the defects for each project indi-

vidually. The monthly counts of defects were computed

using the time stamps of each defect report. The resulting

dataset contained the monthly defects for all OSS and

CSS projects.

4. Analysis and Results

4.1. Model Identification

The first step in model identification is to stabilize the

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

501

Table 1. Sample description.

Projects Description Months Total Defects

wxWidgets wxWidgetsis a free C++ framework that facilitates cross platform soft-

ware development, including GUIs, threads, sockets, database, file system

access, etc 90 4843

Firewall Builder Object Oriented GUI and set of compliers for various firewall platform.

Currently implemented compilers for iptables, ipfiler, OpenBSD, ipfw,

Cisco PIC firewall routers access lists 93 1067

Netatalk

Netatalk is a freely-available Open Source AFP fileserver. It also provides

a kernel level implementation of the AppleTalk Protocol Suite. A

*NIX/*BSD system running Netatalk is capable of serving many Macin-

tosh clients simultaneously as an AppleShare file server (AFP), Apple-

Talk router, *NIX/*BSD print server, and for accessing AppleTalk print-

ers.

69 347

PhpWiki

PhpWiki is a WikiWikiWeb clone in PHP. A WikiWikiWeb is a site

where anyone can edit the pages through an HTML form. Multiple stor-

age backends, dynamic hyperlinking, themeable, scriptable by plugins,

full authentication, ACL's.

99 627

Open Source Projects

Exult Exuit is a game engine for running Ultima7 on modern operating systems,

plus a map editor and other tools for creating your own mods and games. 102 1675

Org A#1 A1 is an n-tier web-enabled wholesale billing application using J2EE and

interfacing with an Oracle database. 58 3539

Org A#2 A2 is an object oriented service rating, pricing and discounting applica-

tion using J2EE and interfacing with Oracle database. 60 1214

Org A#3 A3 is a performance management system providing KPI dashboards and

analytics for monitoring and forecasting, built using a SOA and interfac-

ing with most industry stan dard databases. 58 377

Org B#1 B1 is a payment processing application built on Microsoft platform that

includes check scanning, image and data archival, courtesy amount rec-

ognition and legal amount r ec og ni t io n. 54 1842

Closed Source Projects

Org B#2 B2 is a merchant capture application that allows for the remote digital

capture of check and payment data at the point of presentment and the

bundled transmission for depo s i t i n to multiple accounts 54 582

means and variances by applying a logarithmic transfor-

mation to the time series data. The next step is to plot:

the autocorrelation factors (ACF), the correlation, at spe-

cific lags, between the residuals of the data; and the par-

tial autocorrelation factors (PACF). For each studied

project the values of p (the autoregressive component)

and q (the moving average component) are determined

by examining the trends in the ACF and PACF plots. If

ACF plots die out (i.e., disappear gradually) and PACF

plots cut-off (i.e., disappear abruptly), this suggests that

an autoregressive model is suitable (p > 0, q = 0). If the

opposite is true, i.e. the ACF plots cut-off and PACF

plots die-out, a Moving Average model is suitable (p = 0,

q > 0). If both ACF and PACF die out, then the most

appropriate model contains both a p and q parameter (i.e.

a mixed model is called for). The ACF and the PACF

plots are shown in Figures 1-3. The differencing term is

obtained by examining if the series is stationary or not. In

most software evolution studies, a simple differencing

(i.e., d = 1), transforms the data to a near-linear series

[42].

4.2. Model Estimation

For all five of the studied OSS projects the best fitting

model was ARIMA (0,1,1). Though the OSS project set

used in this study did not overlap with the project set

used by Raja et al., [36] the best fitting model is consis-

tent with their findings. It can be seen that for each OSS

project the p value of the t-statistic is significant for MA1.

The plots of the residual ACF and PACF indicate that the

model provides suitable fit and there are no significant

correlations in the residuals. The final estimates of the

model parameter are shown in Table 2 and the ACF and

PACF plots of the residuals are shown in Figure 4.

The best model for all the three projects in Organiza-

tion A was ARIMA (2,1,0). Several competing models

were evaluated, but based on fit statistics and the residual

analysis the same autoregressive model was the best fit

for all three projects. The final estimates of the model pa-

rameters for each of the sampled Organization A projects

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

502

Figure 1. ACF and PACF plots of the original time series for OSS projects.

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects503

Figure 2. ACF and PACF plots of the original time series for Organization A projects.

Figure 3. ACF and PACF Plots of the original time series for Organization B projects.

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

504

Table 2. Best fitting model specifications for each project.

Project Type Coefficient SE Coefficient t p

OSS #1 MA1

Constant 0.5761

0.17 0.0873

0.6827 6.6

0.25 0

0.804

OSS #2 MA1

Constant 0.5606

–0.1085 0.0872

0.3075 6.43

–0.35 0

0.725

OSS #3 MA1

Constant 0.5693

–0.0854 0.1072

0.178 5.31

–0.48 0

0.633

OSS #4 MA1

Constant 0.6991

0.0005 0.0734

0.0242 9.53

0.02 0

0.984

OSS #5 MA1

Constant 0.6827

–0.0138 0.0743

0.0266 9.19

–0.52 0

0.604

Org A #1 AR1

AR2

Constant

–0.5665

–0.5013

1.219

0.1245

0.1338

3.759

–4.55

–3.75

0.32

0.747

Org A #2 AR1

AR2

Constant

–0.3478

–0.4101

0.376

0.1263

0.1275

1.74

–2.76

–3.25

0.22

0.008

0.002

0.83

Org A #3 AR1

AR2

Constant

–0.8854

–0.6345

0.602

0.1089

1.89

–8.13

–5.83

0.32

0.751

Org B #1 MA1

Constant 0.4163

–0.05461 0.128

0.05364 3.25

–1.02 0.002

0.313

Org B #2 MA1

Constant 0.3088

0.02648 0.151

0.07185 2.05

0.37 0.046

0.714

are shown in Table 2. The ACF and PACF plots of the

residuals are shown in Figure 5.

t model for both of the projects from Organization B

was an ARIMA (0,1,1). Several competing models were

evaluated, but based on fit statistics and the residual

analysis the same Moving Average model was the best fit

for both the projects. The final specifications of the

model parameters for each of the sampled Organization

B projects are shown in Table 2 and the ACF and PACF

plots for Organization B are shown in Figure 6.

4.3. Diagnostic Testing

After estimating the series for all the sample projects,

they are individually tested ag ainst the competing models.

The best model is selected using the t statistics and

goodness of fit tests. The residuals are also analyzed to

ensure that autocorrelation has been removed. We used

the Ljung-Box [43] test for the residual analysis. The null

hypothesis for this test is that ACFs for lag 1 through m

are all 0. If H0 is rejected, it implies that there is signifi-

cant autocorrelation in the residuals. Failure to reject the

null hypothesis means that the co rrelation in the residuals

is insignificant.

The results of the diagno stic testing for all the projects

are shown in Table 3. Across all 10 projects, diagnostic

results show that the selected model fully captures the

behavior of the series and there are no significant missing

elements in the model.

4.4. Model Application

Because the ultimate goal of the research is to develop

models that can be useful for forecasting future defects,

the accuracy of model predictions on unseen data is a

critical factor. We therefore used the hold-out cross-

validation technique for comparing model predictions

[41]. In this method, some data is withheld and not used

during parameter estimation. The selected model is then

used to generate a forecast, which is compared to the

withheld (ac t u a l ) values.

We used a holdout sample of 4 months data for each

project. This number was selected keeping in view the

amount of data available for all projects. Results indicate

the best-fit models identified in the Model Estimation

section were all stable over the sample sets’ holdout se-

ries for each of the 10 studied projects. Across all 10

sampled projects, the mean square error (MSE), mean

absolute percentage error (MAPE), and mean absolute

deviation (MAD) for the previously identified best-fit

models (ARIMA (0,1,1) for OSS and Organization B;

ARIMA (2,1,0) for Organization A) are all lower than

competing models.

5. Discussions

The purpose of this study is o determine whether a time t

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects505

Figure 4. ACF and PACF plots of theresiduals for OSS pr ojec ts.

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

506

Figure 5. ACF and PACF plots of the residuals for organization a projects.

Figure 6. ACF and PACF Plots of the residuals for Organization B projects.

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

507

Table 3. Lijung-Box fit statistics for sampled projects.

Ljung-Box Chi-Square Project Statistic at Lags

Project 12 24 36 48

OSS #1 5.4

(–0.866) 16.9

(0.768) 26.8

(0.807) 32.1

(0.94)

OSS #2 9.2

(–0.514) 25.4

(0.276) 32.9

(0.522) 39.7

(0.73)

OSS #3 14.3

(–0.162) 36

(0.03) 32.94

(0.061) 54.1

(0.193)

OSS #4 9.1

(–0.26) 19

(0.648) 25.2

(0.864) 33.9

(0.908)

OSS #5 12.4

(–0.26) 21.8

(0.475) 26.4

(0.82) 41.6

(0.656)

Org A #1 7.9

(0.545) 16.6

(0.736) 26.1

(0.799) 33.5

(0.895)

Org A #2 9.8

(0.369) 21.3

(0.44) 29

(0.668) 37.9

(0.764)

Org A #3 11.7

(0.228) 21.1

(0.454) 30.3

(0.601) 39.5

(0.703)

Org B #1 6.9

(0.734) 14.1

(0.897) 26.2

(0.827) 34.7

(0.888)

Org B #2 13.6

(0.191) 21.5

(0.491) 34.6

(0.437) 44.8

(0.521)

Note: Chi-Square Statistic with p values in parenthesis.

series approach (which requires no data collection in-

vestment beyond what normally resides in most defect

tracking databases) could be used to accurately predict

patterns in software defects discovered during software

maintenance, the extent to which this approach holds

across a diverse set of projects, organizations, and main-

tenance teams, and whether variations in model parame-

ters can be identified a priori. The evidence from ten

projects is shown in Table 4, and supports study goals.

Across all ten projects examined in this study, reported

defects are accurately predicted using a form of the

ARIMA model (as evidenced by measures including

MSE, MAPE, MAD, and Ljung-Box).

Five of the ten projects are independently developed,

maintained, and managed open source projects. Across

all five OSS projects, the ARIMA (0,1,1) model—a first

order moving average with one order of non-seasonal

differencing—accurately predicts the number of monthly

reported defe cts.

Two of the ten projects are developed and maintained

by a small (30 people, 4 developers) privately held soft-

ware firm using agile methods in a single geographic

location. In this environment, the same ARIMA (0,1,1)

model—best fitting for all five open source projects—

was found to perform best. Because the same team de-

veloped and maintained both products, it is not possible

to explore cross-team differences within this organiza-

tion.

Three of the ten projects are developed by different

teams within a large international software firm using a

mature waterfall-based methodology. Two of these three

projects are maintained by (different) joint North Ameri-

can-Indian teams, and the third by a solely North Ameri-

can team dispersed across two offices in the Southeast.

In this organizational environment, more accurate results

are obtained using a competing second order auto-re-

gressive model with a constant and one order of non-

seasonal differencing i.e., ARIMA (2,1,0). This model

held for all three projects, regardless of team size or geo-

graphic scope.

These results demonstrate promise for the use of the

ARIMA model to predict software defect patterns during

maintenance. This model held across a diverse set of

organizations, teams, geographic collaboration models,

and development approaches. Comparison of the model

fit results across project and team demographics indicates

that parameters may be dependent on factors related to

organization or development approach.

5.1. Implications for Research

Within the maintenance stage, this research responds to

the challenge by Pelayo and Dick [9] for the develop-

ment of models that accurately forecast the occurrence of

defects in software. Results obtained across multiple

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

508

Table 4. Results summary.

Project Best Fitting Model MSE MAPE MAD

OSS #1 ARIMA (0,1,1) 0.099 0.049 0.049

OSS #2 ARIMA (0,1,1) 1.435 2.163 0.717

OSS #3 ARIMA (0,1,1) 0.007 0.028 0.003

OSS #4 ARIMA (0,1,1) 0.362 0.801 0.181

OSS #5 ARIMA (0,1,1) 0.636 0.798 0.318

Org A #1 ARIMA (2,1,0) 0.173 0.038 0.087

Org A #2 ARIMA (2,1,0) 0.212 0.125 0.106

Org A #3 ARIMA (2,1,0) 1.595 0.917 0.797

Org B #1 ARIMA (0,1,1) 1.015 0.630 0.507

Org B #2 ARIMA (0,1,1) 1.624 0.813 0.812

teams, organizations, and development environments

confirm that the ARIMA modeling approach accurately

predicts the pattern of software defects reported during

maintenance.

This project addresses several important research

questions and raises another: what organizational factors

impact the form of the defect reporting time series? One

form of the ARIMA model held for all OSS projects and

all projects developed and maintained by a small pri-

vately held software firm with a self-described "informal,

geek" culture located in a single office and using agile

methods. In contrast, another ARIMA model form held

for all projects developed and maintained by a large in-

ternational software firm characterized by a formal, hier-

archical culture and using a mature waterfall-based

methodology to structure the efforts of globally distrib-

uted teams. These findings suggest that the significant

factors focus on development approach and organiza-

tional culture rather than team distribution. Future re-

search is needed to further explore this idea.

Exploring the differences in patterns of reported de-

fects from a cultural perspective will build on the work

of Gregory [44] who discovered that Silicon Valley

software developers shared the same occupational sub-

culture, regardless of firm or role. Several researchers

have labeled the OSS community a hacker culture that

values and rewards pushing the boundaries of what are

considered doable [45]. This work will add to the under-

standing of the linkages, commonalities and similarities

between the CSS and OSS subcultures.

Exploring the differences in patterns of reported de-

fects from a development approach perspective will al-

low researchers to integrate the strengths and reduce the

weakness of the CSS and OSS development processes.

Crowston and Scozzi [46] characterize free and open

source projects as predominately self-organizing and self-

assigning, often without the formality of appointed lead-

ers or specified roles. This characterization may play an

important role in setting the pattern of software defect

reports and aid in building a more unified model of CSS

and OSS defect management.

5.2. Implications for Practice

In contrast to learning or causal predictive approaches

that require complex models difficult to implement (in-

cluding for example the extraction of source-code level

metrics), the ARIMA time series modeling technique

provides a computationally tractable approach that can be

used by practitioners. Commonly available statistical

packages such as Minitab™, SPSS™, and RATS™ pro-

vide this functionality, which can be implemented with

readily available professional training. In addition the

evidence from the analysis of the sampled projects indi-

cates that the resulting pattern is stable once established

as well as consistent across projects within a particular

organization. Thus, once the pattern is established from

existing defect data, project managers can begin to use

the organizationally-specific model to build staffing and

resource estimates for upcoming planning periods. Main-

tenance staff assignments, testing tool licenses and test-

ing technology environments can be adjusted to be in

alignment with predicted workloads, ensuring that ser-

vice level agreements are met and organizational re-

sources are not idle during slack demand periods.

Based on the robust nature of the results thus far, any

project that does not fit the temporal defect reporting

pattern of the other projects in the organization is a can-

didate for outlier analysis. A pattern shift may be caused

by a number of factors—from changes in user adoption

rates, to changes in business tasks supported, to changes

in the software development, evolution and maintenance

processes. In any even t, a shift in defect reporting pattern

is an indicator that can trigger a root cause inquiry.

5.3. Threats to Validity

We discuss four types of threats to validity: Construct,

content, internal and external [47]. Construct validity

applies to the relationship between theory and observa-

tion and addresses the question: Do the measures quan-

tify what they are expected to? In this study, the major

threat to construct validity is the fact that the project

software defects are not classified by criticality level.

Content valid ity refers to the sampling adequacy of th e

measurement instrument [48]. In this study, the sole

source of software defects gathered is the centrally

maintained defect tracking repository. To the extent that

other sources of reported defects exist (such as message

boards, emails, and direct communication with develop-

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects509

ers), the study’s content validity is threaten ed.

Internal validity is related to the extent to which infer-

ences can be made regarding cause and effect relation-

ships. As is the case with any univariate time series

model where only one variable is considered, this study

is limited in this regard. The impact of other causal vari-

ables is not included in the model.

External validity deals with the generalizability of the

study. Since there has been previous research on tempo-

ral patterns of softwa re maintena nce in OSS projects and

this study confirms the uniformity of the previously dis-

covered patterns, external validity is less a threat in that

domain. For CSS projects, because of the small conven-

ience sample size and limited range of organizations,

teams, and development environments, the gen eralizabil-

ity of the discoveries of this study is not certain. Analysis

of additional CSS projects and the associated defect pat-

terns will help establish generalizability.

There are other additional threats to validity as well.

The mechanism for defect reporting is homogenous

within organizations and within the SourgeForge defect

repository. However, across these sets, the defect report-

ing mechanism is not unifo rm. Issues associated with the

process of defect reporting are not considered in this

study.

Future research can reduce the threats to validity.

Studies that include additional causal v ariab les to contro l

for organizational processes and contract management

will increase the robustness of the model. Replication of

this study using other CSS and OSS projects and organi-

zations will be used to establish the external validity.

6. Conclusions

The introduction section posed two questions important

to software maintenance resource management:

Is there a model to aid in predicting when software

maintenance resources will be needed? If so, is it com-

putationally and economically practical?

This study points to an affirmative answer to both

questions.

In answer to the first question, across all ten projects

evaluated in this study the ARIMA time series modeling

technique was found to provide accurate estimates of

reported defects during software maintenance. ARIMA

model parameters were found to be organizationally de-

pendent. Future research will explore the proposition that

predictive model parameters are dependent on the organ-

izational factors of methodology formalism and organ-

izational culture.

In answer to the second question, the data and compu-

tational needs of the ARIMA models are compared to

alternative prediction techniques. Causal models require

the consistent ongoing extraction and analysis of source-

code level metrics. In addition to this shortcoming,

learning models require relatively sophisticated statistical

and computational expertise and software tools. In con-

trast to these approaches, the ARIMA time series method

is based on readily available defect report data and is less

computationally intensive. Thus, by employing the

ARIMA modeling technique to predict the arrival rates

of the inevitable software defects, maintenance managers

can begin to align their staff and technical resources to

balance the competing demands of cost minimization and

meeting service level expectations..

REFERENCES

[1] P. Van Roy, “Self Management and the Future of Soft-

ware Design,” Electronic Notes in Theoretical Computer

Science, Vol. 182, No. 29, 2007, pp. 201-217.

doi:10.1016/j.entcs.2006.12.043

[2] E. Figueiredo, N. Cacho, C. Sant’Anna, M. Monteiro, U.

Kulesza, A. Garcia, S. Soares, F. Ferrari, S. Khan, F.

Castor Filhoand and F. Dantas, “Evolving Software Prod-

uct Lines with Aspects: An Empirical Study on Design

Stability,” Proceedings of the 30th international Confer-

ence on Software Engineering, Leipzig, 2008, pp. 261-

270.

[3] L. Wallace and M. Keil, “Software Project Risks and

Their Effect on Outcomes,” Communications of the ACM,

Vol. 47, No. 4, 2004, pp. 68-73.

doi:10.1145/975817.975819

[4] B. P. Lientz, E. B. Swanson and G. E. Tompkins, “Char-

acteristics of Application Software Maintenance,” Com-

munications of the ACM, Vol. 21, No. 6, 1978, pp. 466-

471. doi:10.1145/359511.359522

[5] T. M. Pigoski, “Practical Software Maintenance: Best

Practices for Managing Your Software Investment,” John

Wiley & Sons, Inc., Hoboken, 1996.

[6] ISO/IEC14764, “Software Engineering—Software Life

Cycle Processes—Maintenance,” International Organiza-

tion for Standardization, 2006.

[7] G. Stark and P. Oman, “Software Maintenance Strategies:

Observations from the Field,” Journal of Software Main-

tenance, Vol. 9, No. 6, 1997, pp. 365-378.

doi:10.1002/(SICI)1096-908X(199711/12)9:6<365::AID-

SMR160>3.3.CO;2-A

[8] S. Chulani, B. Ray, P. Santhanam and R. Leszkowicz,

“Metrics for Managing Customer View of Software Qual-

ity,” Proceedings. Ninth International Software Metrics

Symposium, Sydney, 2003, pp. 189-198.

[9] L. Pelayo and S. Dick, “Applying Novel Resampling

Strategies to Software Defect Prediction,” Proceedings of

the Meeting of the North American Fuzzy Information

Processing Society, NAFIPS, San Diego, 2007, pp. 69-72.

[10] J. Asundi and S. Sarkar, “Staffing Software Maintenance

and Support Projects,” Proceedings of the 38th Annual

Hawaii International Conference on System Sciences,

HICSS’05, Big Island, 2005, p. 314b.

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

510

[11] T. Chan, “Beyond Productivity in Software Maintenance:

Factors Affecting Lead Time in Servicing Users’ Re-

quests,” Proceedings of the International Conference on

Software Maintenance, San Jose, 2000, pp. 228-237.

[12] G. Antoniol, M. Di Penta and E. Merlo, “An Automatic

Approach to Identify Class Evolution Discontinuities,”

Proceedings of the 7th International Workshop on Prin-

ciples of Software Evolution, Kyoto, 2004, pp. 31-40.

doi:10.1109/IWPSE.2004.1334766

[13] M. C. Ohlsson, A. Amschler Andrews and C. Wohlin,

“Modelling Fault-Proneness Statistically over a Sequence

of Releases: A Case Study,” Journal of Software Main-

tenance, Vol. 13, No. 3, 2001, pp. 167-199.

doi:10.1002/smr.229

[14] T. M. Khoshgoftaar and D. L. Lanning, “A Neural Net-

work Approach for Early Detection of Program Modules

Having High Risk in the Maintenance Phase,” Journal of

Systems and Software, Vol. 29, No. 2, 1995, pp. 85-91.

doi:10.1016/0164-1212(94)00130-F

[15] K. El-Emam and O. Laitenberger, “Evaluating Capture-

Recapture Models with Two Inspectors,” IEEE Transac-

tions on Software Engineering, Vol. 27, No. 9, 2001, pp.

851-864. doi:10.1109/32.950319

[16] T. M. Khoshgoftaar, B. Bhattacharyya and G. Richardson,

“Predicting Software Errors, During Development, Using

Nonlinear Regression Models: A Comparative Study,”

IEEE Transactions on Reliability, Vol. 41, No. 3, 1992,

pp. 390-395. doi:10.1109/24.159804

[17] N. E. Fenton and M. Neil, “A Critique of Software Defect

Prediction Models,” IEEE Transactions on Software En-

gineering, 25, No. 5, 1999, pp. 675-689.

doi:10.1109/32.815326

[18] E. N. Adams, “Optimizing Preventive Service of Soft-

ware Products,” IBM Journal of Research and Develop-

ment, Vol. 28, No. 1, 1984, pp. 2-14.

doi:10.1147/rd.281.0002

[19] M. S. Krishnan and M. I. Kellner, “Measuring Process

Consistency: Implications for Reducing Software De-

fects,” IEEE Transactions on Software Engineering, Vol.

25, No. 6, 1999, pp. 800-815. doi:10.1109/32.824401

[20] M. S. Krishnan, “The Role of Team Factors in Software

Cost and Quality: An Empirical Analysis,” Information

Technology & People, Vol. 11, No. 1, 1998, pp. 20-35.

doi:10.1108/09593849810204512

[21] M. Hecht and J. Handal, “A Discrete-Event Simulator for

Predicting Outage Time and Costs as a Function of Main-

tenance Resources,” Reliability and Maintainability Sym-

posium Proceedings. Annual, Seattle, 2002, pp. 612-617.

[22] M. M. T. Thwin and T.-S. Quah, “Application of Neural

Networks for Software Quality Prediction Using Object-

Oriented Metrics,” Journal System and Software, Vol. 76,

No. 2, 2005, pp. 147-156. doi:10.1016/j.jss.2004.05.001

[23] A. Tsakonas and G. Dounias, “Predicting Defects in

Software Using Grammar-Guided Genetic Program-

ming,” Proceedings of the 5th Hellenic conference on Ar-

tificial Intelligence, SETN, Syros, 2008, pp. 413-418.

[24] S. Dick and A. Sadia, “Fuzzy Clustering of Open-Source

Software Quality Data: A Case Study of Mozilla,”

IJCNN ’06. International Joint Conference on Neural

Networks, Vancouver, 2006, pp. 4089-4096.

doi:10.1109/IJCNN.2006.246954

[25] T. M. Khoshgoftaar, E. B. Allen, W. D. Jones and J. I.

Hudepohl, “Classification Tree Models of Software Qual-

ity over Multiple Releases,” Proceedings. 10th Interna-

tional Symposium in Software Reliability Engineering,

IEEE Reliability Society, 1999, pp. 116-125.

[26] N. Seliya, T. M. Khoshgoftaar and S. Zhong, “Analyzing

Software Quality with Limited Fault-Proneness Defect

Data,” Ninth IEEE International Symposium on High-

Assurance Systems Engineering, HASE, Heidelberg, 2005,

pp. 89-98.

[27] H. Zhang, S. Jarzabek and B. Yang, “Quality Prediction

and Assessment for Product Lines,” Proceedings of the

15th International Conference on Advanced Information

Systems Engineering, Klagenfurt, 2003, pp. 681-695.

[28] T. Menzies, J. Greenwald and A. Frank, “Data Mining

Static Code Attributes to Learn Defect Predictors,” IEEE

Transactions on Software Engineering, Vol. 32, No. 1,

2007, pp. 2-13. doi:10.1109/TSE.2007.256941

[29] V. U. B. Challagulla, F. B. Bastani, I.-L. Yen and R. A.

Paul, “Empirical Assessment of Machine Lea rning Based

Software Defect Prediction Techniques,” IEEE Int’l

Workshop on Object-oriented Real-time Dependable Sys-

tems (WORDS), Sedona, 2005, pp. 282-303.

[30] Q. Song, M. Shepperd, M. Cartwright and C. Mair,

“Software Defect Association Mining and Defect Correc-

tion Effort Prediction,” IEEE Transactions on Software

Engineering, Vol. 32, No. 2, 2006, pp. 69-82.

doi:10.1109/TSE.2006.1599417

[31] S. H. Aljahdali, A. Sheta and D. Rine, “Prediction of

Software Reliability: A Comparison between Regression

and Neural Network Non-Parametric Models,” ACS/IEEE

International Conference on Computer Systems and Ap-

plications, Beirut, 2001, pp. 470-473.

doi:10.1109/AICCSA.2001.934046

[32] W. S. Humphrey and N. D. Singpurwalla, “Predicting

(Individual) Software Productivity,” IEEE Transactions

on Software Engineering, Vol. 17, No. 2, 1991, pp. 196-

207. doi:10.1109/32.67600

[33] A. Mital, A. Desai, A. Subramanian and A. Mital, “Prod-

uct Development: A Structured Approach to Consumer

Product Development, Design, and Manufacture,” But-

terworth-Heinemann, 2008.

[34] P. H. Franses, “Periodicity and Stochastic Trends in Eco-

nomic Time Series,” Oxford University Press, Oxford,

1996.

[35] E. Ghysels and D. R. Osborn, “The Econometric Analysis

of Seasonal Time Series,” Cambridge University Press,

Cambridge, 2001.

[36] U. Raja, D. P. Hale and J. E. Hale, “Modeling Software

Evolution Defects: A Time Series Approach,” Journal of

Software Maintenance and Evolution: Research and

Practice, Vol. 21, No. 6, 2009, pp. 49-71.

doi:10.1002/smr.398

Temporal Patterns of Software Evolution Defects: A Comparative Analysis of Open Source and Closed Source Projects

511

[37] C. F. Kemerer and S. A. Slaughter, “A Longitudinal

Analysis of Software Maintenance Patterns,” ICIS Inter-

national Conference on Information Systems Proceedings

of the eighteenth ICIS, Atlanta, 1997, pp. 476-477.

[38] B. Kenmei, G. Antoniol and M. Di Penta, “Trend Analy-

sis and Issue Prediction in Large-Scale Open Source Sys-

tems,” 12th European Conference on Software Mainte-

nance and Reengineering, CSMR. 2008, pp. 73-82.

[39] C. Chatfield, “The Analysis of Time Series: An Introduc-

tion (6th Edition),” Chapman and Hall/CRC, London,

2003.

[40] G. Box and G. Jenkins, “Time Series Analysis: Forecast-

ing and Control,” Holden-Day, San Francisco, 1970.

[41] R. R. Picard and R. D. Cook, “Cross-Validation of Re-

gression Models,” Journal of the American Statistical

Association, Vol. 79, No. 387, 1984, pp. 575-583.

doi:10.2307/2288403

[42] I. Herraiz, J. M. Gonzalez-Barahona, G. Robles and D. M.

German, “On the Prediction of the Evolution of Libre

Software Projects,” IEEE International Conference on

Software Maintenance, (ICSM 2007), Paris, 2007, pp.

405-414. doi:10.1109/ICSM.2007.4362653

[43] G. M. Ljung and G. E. P. Box, “On a Measure of Lack of

fit in Time Series Models,” Biometrika Trust, Vol. 65, No.

2, 1978, pp. 297-303. doi:10.1093/biomet/65.2.297

[44] K. Gregory, “Native-View Paradigms: Multiple Cultures

and Culture Conflicts in Organizations,” Administrative

Science Quarterly, Vol. 28, No. 3, 1983, pp. 359-376.

doi:10.2307/2392247

[45] D. Thomas, “Hacker Culture,” University of Minnesota

Press, Minneapolis, 2003.

[46] K. Crowston and B. Scozzi, “Open Source Software Pro-

jects as Virtual Organisations: Competency Rallying for

Software Development,” IEE Proceedings Software, Vol.

149, 2002, pp. 3-17. doi:10.1049/ip-sen:20020197

[47] T. T. Dinh-Trong and J. M. Bieman, “The FreeBSD Pro-

ject: A Replication Case Study of Open Source Develop-

ment,” IEEE Transactions on Software Engineering, Vol.

31, No. 6, 2005, pp. 481-494. doi:10.1109/TSE.2005.73

[48] F. N. Kerlinger and H. B. Lee, “Foundations of Behav-

ioral Research,” 4th Edition, Wadsworth, New York,

1999.