Identifying species-specific subsequences in bacteria transcription terminators-A machine learning approach

doi:10.4236/jbise.2009.23031

Paper Menu >>

Journal Menu >>

J. Biomedical Science and Engineering, 2009, 2, 184-189

Published Online June 2009 in SciRes. http://www.scirp.org/journal/jbise

JBiSE

Identifying species-specific subsequences in

bacteria transcription terminators-A

machine learning approach

Bao-Hua Gu1, Yi Sun1

1School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada.

Email: {bgu, sunyi}@cs.sfu.ca

Received 24 March 2008; revised 10 March 2009; accepted 5 April 2009.

ABSTRACT

Transcription Terminators (TTs) play an impor-

tant role in bacterial RNA transcription. Some

bacteria are known to have Species-Specific

Subsequences (SSS) in their TTs, which are be-

lieved containing useful clues to bacterial evolu-

tion. The SSS can be identified using biological

methods which, however, tend to be costly and

time-consuming due to the vast number of sub-

sequences to experiment on. In this paper, we

study the problem from a computational per-

spective and propose a computing method to

identify the SSS. Given DNA sequences of a tar-

get species, some of which a re known to contain

a TT while others not, our method uses machine

learning techniques and is done in three steps.

First, we find a ll frequent subseq uences from th e

given sequences, and show that this can be effi-

ciently done using generalized suffix trees. Sec-

ond, we use these subsequences as features to

characterize the original DNA sequences and

train a classification model us ing Support Vector

Machines (SVM) , one of the currently most effec-

tive machine learning techniques. Using the pa-

rameters of the re sult ing SVM model, w e define a

measure called subsequence specificity to rank

the frequent subsequences, and output the one

with the highest rank as the SSS. Our experi-

ments show that the SSS found by the proposed

method are very close to those determined by

biological experiments. This suggests that our

method, though purely computational, can help

efficiently locate the SSS by effectiv ely narro wing

down the search space.

1. INTRODUCTION

Bacterial genomes are organized into units of expression

that are bounded by sites where transcription of DNA

into RNA is initiated and terminated. The former site is

called Transcription Promoter, which determine what

region and which strand of DNA will be transcribed into

RNA. The later site is called Transcription Terminator

(TT), which is a stop signal to terminate the transcription

process [1].

Although the mechanism of transcription termination

in bacteria is rho-independent, or intrinsic, the TTs in

different bacteria have a common hairpin-like RNA

structure as shown in Figure 1: typically a stem-loop

structure with dyadic stempairing high in guanine (G)

and cytosine (C) residue content, followed by a uracil-

rich (U) stretch of sequence proximal to the termination

site [2].

This particular RNA structure is encoded in the certain

region of the corresponding DNA sequence. It is well

known that the DNA sequences of some bacteria have

species-specific subsequences (SSS) for the TTs. That is,

the subsequence appears significantly more frequently in

certain species than in others. For example, for bacte-

rium Neisseria Meningitidis, the SSS is GCCGTCTGAA,

while for bacterium Haemophilus Influenzae, the SSS is

AAGTGCGGT [3]. Having found many DNA sequences

of TTs of many bacteria, biologists are wondering

whether these bacteria also have such SSS. The biologi-

cal motivation behind is that the SSS might provide bet-

ter understanding of the terminator evolution and how it

functions in genetic exchange between pathogens.

Of course, the SSS can be determined by biological

experiments. However, such experiments tend to be

costly and time-consuming because every subsequence

has to be examined and the number of such subse-

quences can be prohibitively large due to the length of

the DNA sequences. To overcome the difficulty, biolo-

gists typically apply domain knowledge to narrow down

the search space. However, the process of obtaining such

knowledge itself would also be time-consuming and

highly expensive if they are not available yet. Therefore,

it is desirable to study alternative ways to effectively

identify the SSS from the given DNA TT sequences.

B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 185

Figure 1. The stem-loop structure of bacteria transcrip-

tion terminators.

In this paper we consider the problem from computa-

tional perspective. We apply machine learning tech-

niques and propose a method to compute the SSS. Our

method has three steps. In the first step, we finds out all

frequent subsequences that occur more than a given

threshold (or minimum support) by using generalized

suffix tree. In the second step, we take the frequent sub-

sequences as features to vectorize the DNA sequences

and train an SVM classifier. In the third step, we define a

measure of subsequence specificity and output the most

significant features based on the measure as the SSS.

Our experimental results show that the SSS found by our

method are very close to the known SSS for some well

known bacteria. This demonstrates that the proposed

method is effective in identifying the SSS.

Our main contribution here is to introduce a machine

learning approach to identifying the SSS.

We have seen successful applications of machine

learning and data mining approaches to other biological

problems. However, to our best knowledge, this work is

the first that applies machine learning techniques to this

problem. An advantage of our method is that it does not

require biological domain knowledge although such

knowledge would further improve the accuracy. Besides,

our method does not make any assumption specific to

this particular problem, and thus is potentially applicable

to other similar problems.

Moreover, we also contribute to computing commu-

nity by trying out new ways of using classification tech-

niques. Traditionally, a classifier is used to predict class

labels for future unseen examples. But in this paper, we

use the classifier in two new ways: the tested precision

of the trained classification model is used to judge

whether there would be any SSS, and if so the feature

weights contained in the model are used to rank the fea-

tures in the model. In a word, our method can be consid-

ered as a novel application of classification to computa-

tional biology.

The rest of this paper is organized as follows: Section

2 discusses related research. The detail of the proposed

method is presented in Section 3. Section 4 describes the

experimental method, results and discussions. Section 5

summarizes this paper and points out our future work.

2. RELATED WORK

Given a bacteria DNA sequence, how to locate the seg-

ment corresponding to the RNA structure of TTs has

been well studied in molecular biology and biochemistry.

Interested readers are referred to [2] for research in this

area. In this paper, we are dealing with a different prob-

lem: given DNA sequences that encode TTs of certain

bacterium, find out whether there would be any subse-

quence specific to this bacterium and if so what it is. As

far as we know, this problem has not been well studied

in the literature of machine learning and data mining.

Support Vector Machines (SVM) has been widely

studied and applied in machine leaning [4] and data

mining [5] since introduced by [6]. In recent years, it has

been applied to solve classification problems in bio-

computing. For example, it has been used in predicting

outer membrane proteins of bacteria in [7] and remote

protein homology detection in [2].

We notice that in these applications, SVM lassifiers

are mainly used in a traditional way, in which the classi-

fier is built from training sequences and then applied to

predicting the class labels of future unseen sequences. In

our method, the classifier is used in a very different way:

we use it as an indicator of whether there exist features

sufficiently specific to distinguish the positive class from

the negative class. If the precision of the classifier is

much higher than the odds of making a random guess,

we think this classification is good and reliable. Then we

let the classifier to tell us which features are more im-

portant to the classification, based on which we then

define the subsequence specificity measure to rank the

subsequences. The most significant ones are reported as

the SSS.

In our method, the DNA sequences have to be con-

verted into vectors in the feature space solvable by SVM.

The conversion is done by taking all frequent subse-

quences (frequent in terms of a predefined minimum

number of occurrences) found in DNA sequences as

features in the SVM feature space. This sort of conver-

sion is not new, and has proved to be a very effective

way to express genome/protein sequences for classifica-

tion, because it is able to capture the sequential relations

of the genome/protein sequences [7,8]. The frequent

subsequences can be efficiently found by using general-

ized suffix trees.

Suffix tree is a very useful data structure for string

matching problem, and has been well studied and widely

applied. Interested reader are referred to [9] for more details.

3. THE PROPOSED METHOD

Basically, we treat the target problem of finding SSS as a

classification problem. The intuition is that if there exists

186 B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189

any subsequence specific to one species, then we should

be able to build a good classifier to separate this species

from others. Such a classifier should have contained the

SSS as features. As SVM has the state-of-the-art per-

formance for classification, we will train the classifiers

using SVM.

In general, our method can be divided into three steps.

In the first step, it finds out all frequent subsequences

that occur more than a given threshold (called minimum

support) by using generalized suffix trees. Second, it

takes the frequent subsequences as features to vectorize

the original DNA sequences for building the SVM clas-

sifier. If the precision of the classifier is higher than the

odds of making a random guess, we think this is a good

classification. Then we define a measure of subsequence

specificity and output the most significant features based

on the measure. We will describe each step in more de-

tail in the following subsections.

3.1 Frequent Subsequences Generation

Note that SVM requires the input data to be vectorized.

In this paper, the training examples are given in the form

of DNA sequences, which are basically strings of letter

A, C, G and T, representing the four basic amino acids of

DNA. In order to convert a DNA sequence into a vector

format, we need a feature space. Here we use the fre-

quent subsequences to be the features, by which the

original sequence are transformed. We hope that the

subsequences can keep sufficient sequential information

about the original DNA sequences.

Here is a toy example: Given 3 DNA sequences, S1 =

ACGT, S2 = CGTA, S3 = GTAC. If the minimum sup-

port=2 (i.e., a subsequences must appear at least twice to

be considered as frequent), then the frequent subse-

quences are {A,C, G, T,CG,GT,AC}.

It is a nontrivial task to find all frequent subsequences

from thousands of DNA sequences, each of which may

contain more than hundreds of letters, because the total

number of frequent subsequences could be very large.

Although any methods that can output frequent subse-

quences from the original sequences would work as long

as they are computationally feasible. To make it efficient,

we make use of generalized suffix tree [9]. Suffix tree

has proved to be an effective data structure for finding

common substrings in linear time and linear space in

terms of the total number of letters in the given string.

While a generic suffix tree is built for only one string,

it can be easily generalized to contain multiple strings.

The generalized suffix tree provides an efficient way

to finding frequent subsequences in multiple DNA

sequences. Interested readers are referred to the

abovementioned references for details on (generalized)

suffix tree algorithms.

3.2 DNA Sequences Vectorization

Having obtained frequent subsequences, we take each

subsequences as a feature in the feature space, and con-

vert an original DNA sequence into a vector of the fea-

ture space.

Readers familiar with text categorization may draw an

analogy from this step to the text representation, by

viewing a DNA sequence as a document and a frequent

subsequence as a word. Similar to text categorization,

we can have different codings to represent a DNA se-

quence using the frequent subsequences. In our later

experiments, we mainly use the binary coding: if a se-

quence contains a subsequences feature, it is given value

“1” on that feature, otherwise, the feature value is “0”.

This expression, though simple, proves to be very effec-

tive. It is also possible to use other expression such as

TFIDF [10].

Here is a toy example (continued): following the

above toy example, the three sequences can be vector-

ized using the binary coding. Let the feature space be the

frequent subsequences set {A,C, G, T,AC,CG,GT}, the

sequence S1 = ACGT is converted to vector (1,1,1,1,1,1,

1), S2 = CGTA becomes (1,1,1,1,0,1,1), and the se-

quence S3 = GTAC becomes (1,1,1,1,1,0,1).

3.3 SVM Classification

SVM is based on the structural risk minimization princi-

ple from statistical learning theory introduced by V.

Vapnik [6].

In its basic form, SVM learns linear decision hyper-

plane





bxwsignxh )( , described by a weight vec-

tor w and a threshold b. The input is a set of n training

example ,)),(,),,(( 11

nnn RyxyxS  





1,1







For a linearly separable Sn, the SVM finds the hyper-

plane with maximum Euclidean distance to the closest

training examples. This distance is called the margin δ,

as depicted in Figure 2. For a nonlinearly separable Sn,

the SVM uses either a kernel function to map the origi-

nal data into another dimensional space where the data

points are linearly separable, or a hyperplane with a soft

margin the allows some degree of training error in order

to obtain a large margin. Computing the hyperplane is

Figure 2. A linear SVM for a two dimensional training set.

B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 187

equivalent to solving an optimization problem. The re-

sulting model is actually the decision hyperplane, ex-

pressed by w and b.

Note that the resulting weights can be actually viewed

as a ranking of the features: a positive (negative) weight

means the corresponding feature values contribute to the

positive (negative) class; the larger the absolute weight

value, the more important the feature is to the classifica-

tion; weights of zero or very small values have little

contribution to classification. Later we will use this

ranking information to define what we call the measure

of subsequence specificity.

We decide to use SVM for classification because it

has proved to have a number of advantages over other

classification algorithms. First, it can easily handle very

high dimensional input space [11]. Second, it works well

when the problem contain few irrelevant features [11].

Third, it does not require feature selection as a preproc-

essing step, because it can automatically ignore irrele-

vant features during the optimization [11]. Fourth, the

weights contained in the resulting SVM model can be

used to rank the features. All these are essential to solv-

ing the target problem.

In this paper, we use the popular SVM-light imple-

mentation of SVM [12]. Note that when training SVM

classifier, one can select a kernel function suitable for

the studied problem. Motivated by the fact that text

documents are in general linearly separable [11], in our

later experiments we only consider linear SVM model

which is used by default in SVMlight.

3.4 Measure of Good Classifiers

Usually the performance of a classifier is measured by

classification accuracy, precision and/or recall. They are

defined based on a confusion matrix as shown in Table I.

The Precision of the positive class is defined as P=

TP=(TP+FP), while the Recall of the positive class is

defined as R=TP/(TP+FN). The overall Accuracy is de-

fines as A=(TP+TN)/(TP+FP+TN+FN). In our method,

the performance of SVM classifier is not measured by

the overall accuracy, because negative training se-

quences would be much more than those positives, and

the overall accuracy would be heavily affected by the

overwhelming negatives.

Besides, as our goal is to identify the specific subse-

quences in positive species, we hope to maximize the

precision (i.e., the probability of the classifier making a

correct prediction is high). Therefore, we will only use

precision as the measure of classifier’s performance. At

the same time, we will report corresponding recalls for

reference. Note that the precision of the classifier should

be higher than the odds that one makes a random guess

for the positive class label, in order to be considered as a

good classification. Otherwise, the classifier does not

make sense. For example, if a training data set contains

Table 1. Confusion matrix in classification.

Actual number of

positive examples

Actual number of

negative examples

Number of examples

classified as positiveTrue Positive (TP) False Positive (FP)

Number of examples

classified as negativeFalse Negative (FN) True Negative (TN)

100 positive examples and 100 negative examples, if one

randomly guesses for the class label of any of the 200

examples, the probability of making a correct guess is

obviously 50%. If a classifier than a random guess, then

the classifier is useless. A bad classification may suggest

that the information contains in the training examples is

probably insufficient for the classifier to distinguish the

positive examples from the negative examples.

In our method, we take this as an indication that the

positive species probably contains no SSS. In other

words, we consider a classifier is good only when its

precision is much higher than the random guess odds. In

our later experiments, the precision is obtained by test-

ing against a reserved portion of the total data, which is

unseen during the classifier is trained.

3.5 Subsequence Specificity Measure

Note that the resulting classifier will not be used as usual

to predict class label for an unseen sequence. Remember

the task here is to identify the SSS. For this purpose, we

make use of the weights of features in the SVM model,

to define what we call subsequence specificity to meas-

ure how specific a subsequence f is to the positive spe-

cies as follows:

spec(f)=svm_weight(f)×support(f,+)

×confidence(f,+).

here weight (f) is the weight of feature f obtainable from

the learned SVM model, support (f,+) is a measure of

how many sequences of positive species contain f, and

confidence(f,+) is a measure of how many sequences

that contain f belong to positive species.

The definition of the subsequence specificity is based

on the following observations and expectations:

1) An SSS should occur frequently in the positive

species. This means it should have a high support.

2) An SSS should occur more frequently in the posi-

tive species than in the negative species. This means it

should have a high confidence.

3) Having a high support and/or a high confidence

does not necessarily mean a high distinguishing power,

while this can be reliably judged by the feature weights

in SVM model.

4) SVM may give more weight on features having low

support and/or confidence. This means the weights alone

are insufficient to measure the specificity.

5) Each of the three quantities alone characterizes the

subsequence specificity from an unique angle. Combing

188 B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189

them together should be able to give a more accurate

characterization of the specificity, and thus is expected

to be a more reliable measure.

4. EVALUATION

4.1 The Data Set

The data set used in this project is provided by the Brink-

man Libratory of SFU Molecular Biology and Biochemis-

try Department. It contains known rho-independent tran-

scription terminator sequences for 42 bacteria genomes.

The source of the data set is The Institute for Genomic

Research (TIGR). Each record of the data set contains one

terminator sequence (including left tail, stem and loop, and

right tail) and the taxonomy id of the corresponding bacte-

rium. The total number sequences is 12763, and 18 species

have less than 100 terminator sequences.

Besides, we are also provided with a file containing

the taxonomy information of all bacteria species. Figure

3 shows the taxonomy tree for the 42 bacteria, obtained

from the file. The tree shows the evolution paths for the

42 bacteria. Each node in the tree represents a bacterium

species, and marked by its taxonomy id. The leaf nodes

are enclosed by a parenthesis. Note that among the 42

bacteria, 41 are at leaf nodes, except species 562, which

has species 83334 as its children. This is the only pair of

species that has a direct parent-children relationship in

the given data set.

The task is to identify those DNA subsequences that

are specific to certain species. The hope is that this

should provide a better understanding of bacteria evolu-

tion and how it plays a role in genetic exchange be-

tween pathogens. Note that for a given species, it does

not necessarily have any subsequences specific to it.

Also, when talking about the specificity, one should

consider the corresponding scope. That is, with respect

to which species a subsequence is considered specific.

For example, a subsequence can be specific to bacte-

rium A with respect to another bacterium B. Also, the

subsequence can be specific to bacterium A with respect

to all other bacteria.

Among the 42 species, we evaluate our methods on

four bacteria whose SSS for the TTs are already known

biologically. The four species are: Haemophilus influ-

enzae, Neisseria meningitidis, Pasteurella multocida,

and Pseudomonas aeruginosa [3]. As the SSS for the

four bacteria are all found in the left inverted repeat (i.e.,

the left stem) of corresponding TTs, we will also use the

left stem of each TT sequence to find the SSS. For each

of the four species, we employ the one-against-

all-others approach to build classifier by taking all its

TT sequences (the left stems) as positive sequences

while those of the other 41 species as negative se-

uences. q

4.2. Experimental Methodology

All experiments are done using 5-fold cross validation.

Averaged values over the five round experiments are

reported. The linear classifiers are trained using SVM-

light default settings. The frequent subsequences are

obtained from the positive training sequences by setting

the minimum support to be 1% of the total number of

positive sequences (and the absolute support number no

less than 2). These frequent subsequences represent sta-

tistically significant features with regard to the positive

species, while resulting in substantially lower dimen-

sions compared to the feature space of all potential sub-

sequences. The classification performance is evaluated

by the precisions, while the corresponding recalls are

reported for reference.

4.3 Experimental Results

The results are given in Table II, from which we can see

that the precisions for all the four species are signifi-

cantly higher than the respective random guess odds.

This implies that there are subsequence features that can

distinguish the positive species from all the others. We

then output the top 10 subsequences based on the sub-

sequence specificity measure. The results are given in

Ta b l e 2. In Ta ble 3, the resulting subsequences that are

the closest to the known SSS are underlined.

We can see that for species Pasteurella multocida and

Pseudomonas aeruginosa, the underlined subsequences

are exactly the same as the known SSS. For the Haemo-

philus influenzae and Neisseria memingitidis, our results

are very close to the known SSS: the underlined ones are

very close to the known SSS. In all cases, our top 10

subsequences are almost the substrings of the known

SSS. This shows that our method is rather effective in

finding SSSs. Note that the known SSS are found by

applying biological domain knowledge [3], for example,

discarding those biologically meaningless subsequences

and/or limiting the length of the subsequences, which is

not considered in our method. Therefore, it can be ex-

pected that our method can be improved by incorporat-

ing domain knowledge.

Table 2. Classification performance for the four bacteria.

Name of the

Positive

Species

# of Freq

Subseq

Classifier

Precision

Classifier

Recall

Odds of

Random

Guess

Haemophilus

influenzae 325 55.81% 14.18% 3.19%

Neisseria

meningitidis 360 99.61% 50.67% 3.62%

Pasteurella

multocida 389 48.0% 1.09% 3.73%

Pseudomonas

aeruginosa 296 67.77% 9.34% 4.28%

B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 189

Table 3. The top 10 subsequences with the greatest specificity values for the four species.

The Subsequences found for each Bacterium

Top spec

rank Haemophilus Influenzae Neisseria meningitidis Pasteurella multocida Pseudomonas aeruginosa

1 TGCGGT GCCGTC TGCGGTC GCCCCGC

2 GTGCGGT CGTCTG GTGCGGTC CCCCGC

3 GTGCGG CGTCTGA ACCGCAC GCCCCG

4 CGCACTT CGTCTGAA CCGCAC CCCGC

5 CCGCACT GCCGTCT TGCGGT GCCCC

6 CCGCACTT CCGTCTG GTGCGGT CCCGGC

7 GCGGT CCGTCTGA GCGGTC GCCC

8 CGCACT CGTCT GGCGAA GCC

9 TGCGG CCGTCTGAA ACCGCA GGCGACC

1 0

TGCGGTT GTCTGA CGGTC GGC

actual SSS AAGTGCGGT GCCGTCTGAA GTGCGGT GCCCCGC

5. CONCLUSION

In this paper, our goal is to find the species-specific

subsequences for bacteria transcription terminators. By

treating the problem as a classification problem, we

propose a solution based on frequent subsequences and

Support Vector Machines. We first find frequent subse-

quences from the terminator DNA sequences of the posi-

tive species. We then take all such subsequences as fea-

ture space to transform the original DNA sequences into

SVM readable vectors and train SVM classifiers. The

resulting classifiers are used indicators of the existence

of the SSSs. In order to extract the target subsequences

from the SVM model, we make use of the SVM weights

on the features and define a measure called subsequence

specificity. The most significant features based on the

measure are output as the SSS. Our experiments show

that this method is effective. As a conclusion, we have

presented a novel application of classification to compu-

tational biology.

Although the proposed method is designed and evalu-

ated on DNA terminator sequences of bacteria, we be-

lieve that it is applicable to other similar biology tasks

with perhaps minor modifications. As for future work, it

is desirable to evaluate the proposed method on more

bacteria. Besides, the proposed method itself can be im-

proved in many ways. For example, to refine the speci-

ficity measure to make it more accurate, and to find bet-

ter methods to express DNA sequences for classification.

REFERENCES

[1] P. Turner, (2000) Molecular Biology.

[2] M. D. Ermolaeva, H. G. Khalak, O. White, H. O. Smith,

and S. L. Salzberg, (2000) Prediction of transcription

terminators in bacterial genomes, Journal of Molecular

Biology, 301, 27-33.

[3] T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, and

T. Rognes, (2004) Biased distribution of dna uptake se-

quences towards genome maintenance genes, Nucleic

Acids Research, 32(3), 1050-1058.

[4] C. J. C. Burges, (1998) A tutorial on support vector ma-

chines for pattern recognition, Knowledge Discovery and

Data Mining, 2(2).

[5] T. Joachims, (2002) Optimizing search engines using

clickthrough data, in Proceedings of the ACM Confer-

ence on Knowledge Discovery and Data Mining

(KDD-2002).

[6] V. Vapnik, (1995) The Nature of Statistical Learning

Theory. Springer.

[7] S. Rong, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F.

S. L. Brinkman, (2003) Frequent-subsequence-based

prediction of outer membrane proteins, in Proceedings of

2003 ACM SIGKDD Conference.

[8] M. Deshpande and G. Karypis, (2002) Evaluation of

techniques for classifying biological sequences, in Pro-

ceedings of Pacific Asia Conference on Knowledge Dis-

covery and Data Mining (PAKDD-2002).

[9] D. Gusfield, (1997) Algorithms on strings, trees, and

sequences: computer science and computational biology,

Cambridge University Press.

[10] G. Salton and C. Buckley, Term weighting approaches in

automatic text retrieval, Information Processing and

Management, 24(5), 1988.

[11] T. Joachims, (1998) Text categorization with support

vector machines: Learning with many relevant features,

in Proceedings of the European Conference on Machine

Learning (ECML-1998).

[12] (2002) Svmlight support vector machine, web.

[13] B. Gu, (2007) Discovering species-specific transcription

terminators for bacteria, School of Computing Science,

Simon Fraser University, Tech. Rep.

[14] J. R. Quinlan, (1993) C4.5: Programs for machine learn-

ing, Morgan Kaufmann Publisher.

[15] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.

Stone, (1984) Classification and regression trees,

Wadsworth.

[16] H. O. Lancaster, (1969) The chi-squared distribution,

John & Sons.