J. Software Engineering & Applications, 2009, 2: 354-360
doi:10.4236/jsea.2009.25047 Published Online December 2009 (http://www.SciRP.org/journal/jsea)
Copyright © 2009 SciRes JSEA
Improved Genetic Programming Algorithm
Applied to Symbolic Regression and Software
Reliability Modeling
Yongqiang ZHANG1, Huifang CHENG1, Ruilan YUAN2
1School of Information and Electric Engineering, Hebei University of Engineering, Handan, China; 2College of Arts, Hebei Univer-
sity of Engineering, Handan, China.
Email: yqzhang@hebeu.edu.cn, hfcheng4@163.com, yuanruilan_2000@sina.com
Received June 24th, 2009; revised September 7th, 2009; accepted September 16th , 2009.
ABSTRACT
The present study aims at improving the ability of the canonical genetic programming algorithm to solve problems, and
describes an improved genetic programming (IGP). The proposed method can be described as follows: the first inves-
tigates initializing population, the second investigates reproduction operator, the third investigates crossover operator,
and the fourth investigates mutation operation. The IGP is examined in two domains and the results suggest that the
IGP is more effective and more efficient than the canonical one applied in different domains.
Keywords: Improved Genetic Programming, Symbolic Regression, Software Reliability Model
1. Introduction
Genetic Programming (GP) is an automated method for
creating a working computer program from a high-level
problem statement of a problem [1,2]. It is a technique
pioneered by John Koza [3] which enables computers to
solve problems without being explicitly programmed and
based on the idea of genetic algorithms presented by John
Holland [4]. The goal is to use the concepts of Darwin
evolution theory for computer program induction. The
concepts are usually applied by genetic operators, such as
selection, crossover, mutation and reproduction [5].
In Genetic Programming solutions to a problem are
represented as syntactic trees (or symbolic expressions),
which are evolved in a population of programs towards
an effective solution to specific problems according to
Darwinism. The flexibility and expressiveness of com-
puter program representation, combined with the power-
ful capability of evolutionary search, makes GP a prom-
ising method to solve a great variety of problems [6].
1.1 Goals
To summarize, the learning/evolutionary process of the
canonical GP algorithm has at least the following prob-
lems: the scale of the population is usually quit large and
the convergence of the algorithm is very slow [7]; the
evolved programs are usually too big and contain a large
number of redundancy; programs within the initial popu-
lation are generated randomly; setting the algorithm pa-
rameter depends on the experience. To overcome these
problems, the overall goal of this paper is to investigate a
new approach to improve the efficiency of GP algorithm.
Specifically, we will investigate whether the new ap-
proach outperforms the canonical GP in terms of data
fitting performance and training time in the evolutionary
process.
1.2 Structure
The rest of the paper is organized as follows. Section 2
gives a brief overview of the canonical GP algorithm.
Section 3 describes four developments to improve GP
performance. Section 4 describes the experiment design
and configurations. Section 5 is application. Section 6
draws the conclusions and gives future directions.
2. The Canonical GP Algorithm
Genetic Programming starts with an initial population of
randomly created programs composed of functions and
terminals appropriate to the problem domain [1,2]. Then
all programs in the population are evaluated in terms of
how well it performs in the particular problem environ-
ment. This evaluation is called fitness measure. Accord-
ing to the theory of survival of the fittest, genetic opera-
tions are used to create a new offspring population of
programs from the current population [8]. But the pa-
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
355
rameters of genetic programming should be set in ad-
vance: those include the probability of reproduction,
crossover, and mutation. Over many generations, the
exact or the optimal solution will be found [9].
3. Algorithm Optimization
3.1 Problems of Canonical GP Algorithm
Problems of canonical GP algorithm are listed as follows.
Firstly, the quality of the randomly created by initial
programs can not be guaranteed against bad. Secondly,
when the number of programs in the population is bigger,
the convergence of the algorithm can be very slow, and
one or two programs will be chosen randomly for repro-
duction, mutation or crossover during the evolutionary
process. If the selected programs are too big, the evolu-
tion for the rest of the leaning process will be slow. More
importantly, the good building blocks in these big pro-
grams will have a much greater chance to be destroyed
than in the small ones, which could lead to poor solutions
by the evolutionary process.
3.2 Improvement
Create a population of programs according to qualifica-
tion. The concrete method is described as follows:
Step 1, set the scale of the population as M and then
produce the initial population randomly with scale of
MM bigger than M.
Step 2, retain M programs with better fitness in the
population to replace the initial population.
Regarding this initial population creation process the
selection of MM (big scale) has the direct impact on the
convergence rate. If there is little difference between
MM and M, the created initial population will be similar
to the random product one. If there is a great deal of dif-
ference between the MM and M, the initial population
creation process will need a long computing time. We set
MM: M=4:3.
In Genetic Programming reproduction that copies bet-
ter programs to the next generation, it gives expression to
survival of the fitness [2]. Different reproduction method
has different selection intensity. Generally speaking, the
reproduction operator structure gives one kind of choice
plan, which makes better programs in the current popula-
tion easily enter the next generation population. In the
canonical roulette wheel selection method, an individual
in a population will be selected according to the propor-
tion of its own fitness to the total sum of the fitness of all
the programs in the population [10]. Namely programs
with low fitness scores will have a low probability to be
reproduced. Programs which perform particularly well
will have a very high probability of being selected. But
roulette wheel selection has its deficiency: on the one
hand the roulette wheel selection has not selected all bet-
ter programs. Some programs with high fitness scores
have not the possibility to be duplicated to the next gen-
eration population, and even some inferior programs are
possibly to be selected into the next generation. On the
other hand, programs with good building blocks and low
fitness scores have the possibility not to be duplicated. It
easily arises the phenomenon of premature. Premature
means that evolution is converging in the local optima,
but it is not converging in the overall optima.
In order to avoid the defects of roulette wheel selection
method while retaining its advantages, a new method for
improving reproduction operator is proposed as follows.
Add the optimum programs to the next generation of the
first category gene pool of the current population. Pro-
grams in the current population besides the first category
gene pool are to be selected by the roulette wheel selec-
tion to the next generation population. Programs in the
first category gene pool are to be copied to the next gen-
eration.
Crossover combines the genetic material of two par-
ents by swapping a part of one parent with a part of the
other [2]. With a tree-based representation, replacing a
node means replacing the whole branch. This adds
greater effectiveness to the crossover operator. The ex-
pressions resulting from crossover are greatly different
from their initial parents. Therefore, it has the necessity
to use the pre-selection mechanism to choose programs
for crossover only when the new programs fitness scores
are higher than the father programs they can replace the
older ones into the next generation, otherwise carries out
the crossover operator again. Because of the structure
similarity between the new programs and the replaced
father programs, the genetic material of two parents is
replaced by the same structure of the programs. There-
fore, it can effectively keep the diversity of the popula-
tion in evolution. The algorithm is more hopeful at find-
ing the best individual in the whole population.
Mutation affects an individual in the population. Per-
forming mutation operator is beneficial to form the di-
versity of the population and avoid premature [1]. The
mutation probability is quite small, taking 0~0.05 gener-
ally. The mutation operator randomly selects a point in
the tree and replaces the existing sub-tree at that point
with a new randomly generated sub-tree. If the replaced
node is a function, the node has the same number argu-
ments is selected from function set. If the replaced node
is a terminal node, then terminal is selected to form the
leaf node [2].Thus fitness of the selected tree is changed.
A worse individual is very likely to become better and a
better individual is also very likely to become worse after
performing mutation operator. Two ways have been
adopted to perform mutation operator. In initial stage it
replaces a whole node in the selected individual, that is to
say, the operator removes a random sub-tree of a selected
individual, and then puts a new sub-tree in the same
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
356
place. In later stage programs tends to be good, then it
replaces just the node's information.
4. Experiments Design and Configurations
4.1 Program Representation and Generation,
and Genetic Operators
In this Improved Genetic Programming (IGP), we use
tree structures to represent genetic programs. The ramped
half-and-half method [1] is used for generating the pro-
grams restricted by an initial maximum depth in the ini-
tial population. In the mutation operator and crossover
operator programs are restricted by a maximum depth.
The maximum size allowed for a program is set as one
of the principal parameters of a GP run to limit the depth
of the program tree in crossover operator and mutation
operator, and that may control the redundancy expression
for individual to a certain degree, also may reduce com-
puting time greatly.
Pre-selection mechanism, changing mutation intensity
in mutation operator, is used in learning process. The
elitist and roulette wheel selection mechanism is used in
crossover operator and reproduction operator [11].
More details of program representation and genera-
tion methods and the genetic operators are described in
Section 3.
4.2 Function and Terminal Sets
In the function set, the five standard arithmetic operators
and seven math functions are used to form the non-ter-
minal nodes:
FuncSet {,,.*, ./, .^, sin, cos, tan, co
t, exp, log,sqrt}.
=+−
The +, andoperators have their usual mean-
ings——addition and subtraction, while.*, ./, .^ repre-
sents the variable with correspondence matrix elements
carrying on multiplication, left division and involution.
The programs are developed in MATLAB7.0.
Table 1. Basic parameters of Genetic Programming
Parameters Values Parameters Values
population-size 800 internal-node-rate
0.9
initial population
method half-and-half
method initial-max-depth
5
crossover-rate 0.70 max-depth after
mutation 7
mutation-rate 0.05 max-depth after
mutation 7
reproduction-rate
0.05 max-generations 100
4.3 Parameters and Termination Criteria
In canonical GP the probabilities of selection, crossover
and mutation are set in advance depending on the ex-
perience. The evolution process carries on under the
predetermined probability from start to the end. In the
IGP all control parameters of genetic programming algo-
rithm are optimized and combined by use of the or-
thogonal experiment method. Orthogonal experiment is a
method of studying the multi-factor multi-levels design
experiment, and this method can curb the blindness dur-
ing accessing parameter in GP and let us obtain the sci-
entific experimental results by a few testing sequences.
The basic parameter values used in this approach are
shown in Table 1.
The evolutionary process runs for a fixed number
(max-generation) of generations unless it finds a solution
or a program is close enough to the desired solution.
5. Application
5.1 Applied to Symbolic Regression
The IGP has been tested by two considerable complexity
examples, and under the same platform and precision
compared to the original algorithm.
Function 1
432
y
xxxx
(1)
Function 2
sin3sin2sin
yxxx
=++ (2)
For reducing the factor influence as far as possible in
the IGP evolution process, the experiments are repeated
50 times for all cases and the average results are shown
in Figure 1~Figure 4. These results are compared with
the canonical GP and the IGP using same instances.
11.52 2.53 3.544.5 5
0
10
20
30
40
50
60
70
80
90
100
experiments times
convergence times,iteration algebra
convergence analysis
iteration algebra of canonical GP
convergence times of canonical GP
iteration algebra of new approach
convergence times of new approach
Figure 1. Compairson convergence between GP and the
new approach (Function 1)
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
357
11.5 2 2.533.5 4 4.5 5
0
10
20
30
40
50
60
70
80
90
100
experiments times
convergence times,iteration algebra
convergence analysis
iteration algebra of canonical GP
convergence times of canonical GP
iteration algebra of new approach
convergence times of new approach
Figure 2. Comparison convergence between GP and the
new approach (Function 2)
12345678910
0
1
2
3
4
5
6
7
8
9
10 x 104
x value
y value
g=x.
4
+x.
3
+x.
2
+x as instance
Actual curve
Fitting curve of GP
Fitting curve of new approach
Figure 3. Simulation result of GP and the new approach
(Function 1)
12345678910
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
x value
y value
g=sin(3*x)+sin(2*x)+sin(x) as instance
Actual curve
Fitting curve of GP
Fitting curve of new approach
Figure 4. Simulation result of GP and the new approach
(Function 2)
Fitting functions expression:
Function 1. GP fitting expression
(sin)log4
xx
yxxxe
=−+
Function 1. IGP fitting expression
4
tan(cot(cot()))
xx
yxxxxex
=+−+
Function 2. GP fitting expression
sintan
costan()
(cot())/
xx
xxx
yxxxe
=−+
Function 2. IGP fitting expression
tan
/log()/cossin
x
xx
yxxexx
=+
As seen from Figure 1 ~ Figure 4, the results demon-
strate the ability of the IGP in this paper in the iterative
algebra, the restraining time and the smooth fitting, and
all have the distinct improvement. The results demon-
strate the ability of the improved algorithm which is
more efficient in evolving good programs for best solu-
tions in symbolic regression.
5.2 Applied to Software Reliability Modeling
In the software qualitative target system, software reli-
ability is the most important inherent characteristics.
Software reliability models are the basis of quantitative
analysis, and through the model the software reliability
can be quantitatively assessed and predicted, and then
estimate the delivery of the date, adjust the distribution
of resources to determine whether the software has
reached a predetermined reliability requirements and so
on.
Among several models proposed in the literature, GP
evolutionary model has its own characteristics
u Retain all known information, directly involved
in the quantitative calculations, and minimize the accu-
mulation of errors;
u In addition to the original data, there are no fac-
titious assumptions to maximize the faithful to a given
data.
Software reliability modeling can be classified into
two classes, based on whether the models center on fail-
ure times or failure counts. In this paper, we mainly fo-
cus on failure times. During the testing process when the
failure data are collected, they can come in the format of
failure times, i.e., (t1,t2,,tn). ti is the time when ith
failure is observed. We give two examples, one failure
data series (in Table 2) comes from a software which is
from armored force of engineering university, and the
other (in Table 3) comes from NTDS(Naval Tactical
Data System) of U.S. Navy Fleet Computer Program-
ming Center).
In the following two tables x is sequence number and ti
is the cumulative failure time series.
u Calculation of reliability parameter
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
358
Table 2. Failure data series [12]
x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
xi 1 1 1 5 4 24 6 14 33 1 30 22 13 22 77 7
ti 1 2 3 8 12 36 42 56 89 90 120 142 155 177 254 261
Table 3. Failure data series (NTDS) [13]
x ti x ti x ti
1 9 11 71 21 116
2 21 12 77 22 149
3 32 13 78 23 156
4 36 14 87 24 247
5 43 15 91 25 249
6 45 16 92 26 250
7 50 17 95 27 337
8 58 18 98 28 384
9 63 19 104 29 396
10 70 20 105 30 405
Under the same platform and precision compared to
the original algorithm, after a 100-generation evolution of fitness computing to be better models for (x in Table
2-3 in the software failure time measured, and x> 1):
()
()
( )
(
)
(
)
(
)
cot
costanlnsin x
GP
Tfxxxxxxxx==⋅⋅−+
sin
()cos((cos)/cot(tan))
x
IGPIGP
Tfxxxxxxxx
==++
Cumulative failure time Calculated by the GP model
t17= 292.1355, t16 time the average time between fail-
ures MTBF = 31.1355. Cumulative failure time Calcu-
lated by the IGP model t17= 297.0135, t16 time the av-
erage time between failures MTBF = 31.0436. The cu-
mulative failure time of the observed results t17=300, t16
time the average failure time MTBF = 39.
()
( )
( )
( )
ln
sin/cottan
x
x
NTDS
Tfxxxxexx==+−⋅
(
)
2
1.2log
NTDSIGP
Tfxxx
==
Cumulative failure time Calculated by the GP
model 27
315.22
t=, t26 time the average time between
failures MTBF = 65.22. The cumulative failure time of
the observed results t27=337, t26 time the average failure
time MTBF = 47.
To summarize, the results from the experiment show
that software reliability models are established by IGP
having relatively good predictive power for one step.
u Failure curve
The initial failure rate is 0.59948, 0.6447 respectively
calculated by the model (TGP), (TIGP). The current fail-
ure rate (t = 261) of the Software is 0.048614, 0.0038378
respectively. The software failure curve is shown below.
0246810 12 14 1618
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
The Number of Software Testing
Filure Rate at Current Time
Failure Rate Curve by GP Modeling
Figure 5. Failure curve of (TGP) model
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
359
0246810 12 14 16
18
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Number of Software Testing
Filure Rate at Current Time
Failure Rate Curve by IGP Modeling
Figure 6. Failure curve of (TIGP) model
05 10 15 2025
30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
The Number of Software Testing
Filure Rate at Current Time
Failure Rate Curve by GP Modeling
Figure 7. Failure curve of (TNTDS) model
0510 1520 2530
0
0.5
1
1.5
The Number of Software Testing
Filure Rate at Current Time
Failure Rate Curve by IGP Modeling
Figure 8. Failure curve of (TNTDS-IGP) model
0246810 12 14 1618
0
50
100
150
200
250
300
Software Testing Serial Number
Cumulative Failure Time
Data from Armored Force of Engineering University
evolved by GP
evolved by IGP
real failure data series
Figure 9. Simulation result of GP and IGP model
0510 15 20 25
30
0
50
100
150
200
250
300
350
400
450
Software Testing Serial Number
Cumulative Failure Time
Data from NTDS
evolved by GP
evolved by IGP
real failure data series
Figure 10. Simulation result of GP and IGP model
The initial failure rate is 0.50.24869 respectively
calculated by the model (TNTDS), (TNTDS-IGP). The
current failure rate (t = 405) of the Software is
0.00267750.0022268 respectively. The software failure
curve is shown above.
u Model simulation
Simulation studies are presented to validate the models.
The results show that the models set up by IGP are better
than the models set up by the canonical algorithms.
According to the theoretical and experimental analysis,
our IGP algorithm not only provides the same excellent
performance as the canonical GP, but also can save con-
siderable convergence time against canonical GP.
6. Conclusions
GP is a powerful paradigm that can be used to solve dif-
Improved Genetic Programming Algorithm Applied to Symbolic Regression and Software Reliability Modeling
Copyright © 2009 SciRes JSEA
360
ferent problems in several domains. However, the evolu-
tionary time is quite long. The goal of this paper is to
investigate ways to improve the power of GP algorithm.
We describe fitness-based method to generate initial
population, elitist and roulette wheel selection mecha-
nism and pre-selection mechanism used in crossover op-
erator, and set a limit to depth of the program tree in
crossover operator and changing mutation intensity in
mutation operator. The approach was examined in two
symbolic regression experiments and two software reli-
ability modeling all achieved much better results for sev-
eral problems in different domains than the canonical
one.
The improved genetic programming has accelerated
the speed of the GP convergence and avoided the traps of
local optima to a great extent. We will investigate
whether the performance on the smooth fit can be im-
proved more by using some math tools.
7. Acknowledgments
The authors wish to thank the Natural Science Founda-
tion of Hebei province (Project No.F2008000752) under
which the present work was possible.
REFERENCES
[1] H. Wright, Introduction to Genetic Programming, USA:
University of Montana, CS555/495:1-2, 2002.
[2] Wolfgang, N. Peter, E. K. Robert and D. F. Frank, Ge-
netic Programming: an introduction on the automatic
evolution of computer programs and its applications,
San Francisco, Calif: Morgan Kaufmann Publishers:
Heidelberg: Dpunkt-verlag. Subject: Genetic Program-
ming (Computer science): ISBN: 1-55860-510-X, 1998.
[3] J. R. Koza, Genetic Programming : automatic discov-
ery of reusable programs , Cambridge M A :M IT Press,
1994
[4] J. H. Holland, Adaptation in natural and artificial Sys-
tems, MIT Press, 1992.
[5] R. V. Silvia and P. Aurora, A grammarguided Genetic
Programming framework configured for data mining and
software testing, International Journal of Software En-
gineering and Knowledge Engineering, Vol. 16, No. 2, pp.
245267, 2006.
[6] W. Banzhaf, P. Nordin, R. E. Keller and F. D. Francone,
Genetic Programming-an introduction on the automatic
evolution of computer programs and its application,
Morgan Kaufmann Publishers, Inc, 1998.
[7] J. L. Bradley, M. Sudhakar and P. Jens, Program optimi-
zation for faster genetic programming, In Genetic Pro-
gramming-GP98, pp. 202207, Madison, Wisconsin,
July 1998.
[8] M. D. Kramer and D. Zhang, GAPS: A genetic pro-
gramming system, The Twenty-Fourth Annual Interna-
tional Computer Software and Applications Conference,
pp. 614619, 2000.
[9] A. P. Fraser, Genetic Programming in C++[R], Cyber-
netics Research Institute, TR0140, University of Sanford,
1994.
[10] A. Augusto, symbolic regression via Genetic Program-
ming, VI Brazilian Symposium on Neural Network, pp.
173178, 2000.
[11] T. Walter, Recombination, selection, and the genetic
construction of computer programs, PhD thesis. Faculty
of the Graduate School, University of Southern California,
Canoga Patch, California, USA, April 1994.
[12] Y. Gong and Q. Zhou, A software test report SRTP,
Armored Force Engineering Institute, China, 1995.
[13] X. Huang, Software reliability, safety and quality assur-
ance, Beijing: Electronics Industry Press, Vol. 10, pp.
920, 1517, 86112, 2002.