J. Biomedical Science and Engineering, 2009, 2, 184-189
Published Online June 2009 in SciRes. http://www.scirp.org/journal/jbise
Identifying species-specific subsequences in
bacteria transcription terminators-A
machine learning approach
Bao-Hua Gu1, Yi Sun1
1School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada.
Email: {bgu, sunyi}@cs.sfu.ca
Received 24 March 2008; revised 10 March 2009; accepted 5 April 2009.
Transcription Terminators (TTs) play an impor-
tant role in bacterial RNA transcription. Some
bacteria are known to have Species-Specific
Subsequences (SSS) in their TTs, which are be-
lieved containing useful clues to bacterial evolu-
tion. The SSS can be identified using biological
methods which, however, tend to be costly and
time-consuming due to the vast number of sub-
sequences to experiment on. In this paper, we
study the problem from a computational per-
spective and propose a computing method to
identify the SSS. Given DNA sequences of a tar-
get species, some of which a re known to contain
a TT while others not, our method uses machine
learning techniques and is done in three steps.
First, we find a ll frequent subseq uences from th e
given sequences, and show that this can be effi-
ciently done using generalized suffix trees. Sec-
ond, we use these subsequences as features to
characterize the original DNA sequences and
train a classification model us ing Support Vector
Machines (SVM) , one of the currently most effec-
tive machine learning techniques. Using the pa-
rameters of the re sult ing SVM model, w e define a
measure called subsequence specificity to rank
the frequent subsequences, and output the one
with the highest rank as the SSS. Our experi-
ments show that the SSS found by the proposed
method are very close to those determined by
biological experiments. This suggests that our
method, though purely computational, can help
efficiently locate the SSS by effectiv ely narro wing
down the search space.
Bacterial genomes are organized into units of expression
that are bounded by sites where transcription of DNA
into RNA is initiated and terminated. The former site is
called Transcription Promoter, which determine what
region and which strand of DNA will be transcribed into
RNA. The later site is called Transcription Terminator
(TT), which is a stop signal to terminate the transcription
process [1].
Although the mechanism of transcription termination
in bacteria is rho-independent, or intrinsic, the TTs in
different bacteria have a common hairpin-like RNA
structure as shown in Figure 1: typically a stem-loop
structure with dyadic stempairing high in guanine (G)
and cytosine (C) residue content, followed by a uracil-
rich (U) stretch of sequence proximal to the termination
site [2].
This particular RNA structure is encoded in the certain
region of the corresponding DNA sequence. It is well
known that the DNA sequences of some bacteria have
species-specific subsequences (SSS) for the TTs. That is,
the subsequence appears significantly more frequently in
certain species than in others. For example, for bacte-
rium Neisseria Meningitidis, the SSS is GCCGTCTGAA,
while for bacterium Haemophilus Influenzae, the SSS is
AAGTGCGGT [3]. Having found many DNA sequences
of TTs of many bacteria, biologists are wondering
whether these bacteria also have such SSS. The biologi-
cal motivation behind is that the SSS might provide bet-
ter understanding of the terminator evolution and how it
functions in genetic exchange between pathogens.
Of course, the SSS can be determined by biological
experiments. However, such experiments tend to be
costly and time-consuming because every subsequence
has to be examined and the number of such subse-
quences can be prohibitively large due to the length of
the DNA sequences. To overcome the difficulty, biolo-
gists typically apply domain knowledge to narrow down
the search space. However, the process of obtaining such
knowledge itself would also be time-consuming and
highly expensive if they are not available yet. Therefore,
it is desirable to study alternative ways to effectively
identify the SSS from the given DNA TT sequences.
B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 185
SciRes Copyright © 2009 JBiSE
Figure 1. The stem-loop structure of bacteria transcrip-
tion terminators.
In this paper we consider the problem from computa-
tional perspective. We apply machine learning tech-
niques and propose a method to compute the SSS. Our
method has three steps. In the first step, we finds out all
frequent subsequences that occur more than a given
threshold (or minimum support) by using generalized
suffix tree. In the second step, we take the frequent sub-
sequences as features to vectorize the DNA sequences
and train an SVM classifier. In the third step, we define a
measure of subsequence specificity and output the most
significant features based on the measure as the SSS.
Our experimental results show that the SSS found by our
method are very close to the known SSS for some well
known bacteria. This demonstrates that the proposed
method is effective in identifying the SSS.
Our main contribution here is to introduce a machine
learning approach to identifying the SSS.
We have seen successful applications of machine
learning and data mining approaches to other biological
problems. However, to our best knowledge, this work is
the first that applies machine learning techniques to this
problem. An advantage of our method is that it does not
require biological domain knowledge although such
knowledge would further improve the accuracy. Besides,
our method does not make any assumption specific to
this particular problem, and thus is potentially applicable
to other similar problems.
Moreover, we also contribute to computing commu-
nity by trying out new ways of using classification tech-
niques. Traditionally, a classifier is used to predict class
labels for future unseen examples. But in this paper, we
use the classifier in two new ways: the tested precision
of the trained classification model is used to judge
whether there would be any SSS, and if so the feature
weights contained in the model are used to rank the fea-
tures in the model. In a word, our method can be consid-
ered as a novel application of classification to computa-
tional biology.
The rest of this paper is organized as follows: Section
2 discusses related research. The detail of the proposed
method is presented in Section 3. Section 4 describes the
experimental method, results and discussions. Section 5
summarizes this paper and points out our future work.
Given a bacteria DNA sequence, how to locate the seg-
ment corresponding to the RNA structure of TTs has
been well studied in molecular biology and biochemistry.
Interested readers are referred to [2] for research in this
area. In this paper, we are dealing with a different prob-
lem: given DNA sequences that encode TTs of certain
bacterium, find out whether there would be any subse-
quence specific to this bacterium and if so what it is. As
far as we know, this problem has not been well studied
in the literature of machine learning and data mining.
Support Vector Machines (SVM) has been widely
studied and applied in machine leaning [4] and data
mining [5] since introduced by [6]. In recent years, it has
been applied to solve classification problems in bio-
computing. For example, it has been used in predicting
outer membrane proteins of bacteria in [7] and remote
protein homology detection in [2].
We notice that in these applications, SVM lassifiers
are mainly used in a traditional way, in which the classi-
fier is built from training sequences and then applied to
predicting the class labels of future unseen sequences. In
our method, the classifier is used in a very different way:
we use it as an indicator of whether there exist features
sufficiently specific to distinguish the positive class from
the negative class. If the precision of the classifier is
much higher than the odds of making a random guess,
we think this classification is good and reliable. Then we
let the classifier to tell us which features are more im-
portant to the classification, based on which we then
define the subsequence specificity measure to rank the
subsequences. The most significant ones are reported as
the SSS.
In our method, the DNA sequences have to be con-
verted into vectors in the feature space solvable by SVM.
The conversion is done by taking all frequent subse-
quences (frequent in terms of a predefined minimum
number of occurrences) found in DNA sequences as
features in the SVM feature space. This sort of conver-
sion is not new, and has proved to be a very effective
way to express genome/protein sequences for classifica-
tion, because it is able to capture the sequential relations
of the genome/protein sequences [7,8]. The frequent
subsequences can be efficiently found by using general-
ized suffix trees.
Suffix tree is a very useful data structure for string
matching problem, and has been well studied and widely
applied. Interested reader are referred to [9] for more details.
Basically, we treat the target problem of finding SSS as a
classification problem. The intuition is that if there exists
186 B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189
SciRes Copyright © 2009 JBiSE
any subsequence specific to one species, then we should
be able to build a good classifier to separate this species
from others. Such a classifier should have contained the
SSS as features. As SVM has the state-of-the-art per-
formance for classification, we will train the classifiers
using SVM.
In general, our method can be divided into three steps.
In the first step, it finds out all frequent subsequences
that occur more than a given threshold (called minimum
support) by using generalized suffix trees. Second, it
takes the frequent subsequences as features to vectorize
the original DNA sequences for building the SVM clas-
sifier. If the precision of the classifier is higher than the
odds of making a random guess, we think this is a good
classification. Then we define a measure of subsequence
specificity and output the most significant features based
on the measure. We will describe each step in more de-
tail in the following subsections.
3.1 Frequent Subsequences Generation
Note that SVM requires the input data to be vectorized.
In this paper, the training examples are given in the form
of DNA sequences, which are basically strings of letter
A, C, G and T, representing the four basic amino acids of
DNA. In order to convert a DNA sequence into a vector
format, we need a feature space. Here we use the fre-
quent subsequences to be the features, by which the
original sequence are transformed. We hope that the
subsequences can keep sufficient sequential information
about the original DNA sequences.
Here is a toy example: Given 3 DNA sequences, S1 =
ACGT, S2 = CGTA, S3 = GTAC. If the minimum sup-
port=2 (i.e., a subsequences must appear at least twice to
be considered as frequent), then the frequent subse-
quences are {A,C, G, T,CG,GT,AC}.
It is a nontrivial task to find all frequent subsequences
from thousands of DNA sequences, each of which may
contain more than hundreds of letters, because the total
number of frequent subsequences could be very large.
Although any methods that can output frequent subse-
quences from the original sequences would work as long
as they are computationally feasible. To make it efficient,
we make use of generalized suffix tree [9]. Suffix tree
has proved to be an effective data structure for finding
common substrings in linear time and linear space in
terms of the total number of letters in the given string.
While a generic suffix tree is built for only one string,
it can be easily generalized to contain multiple strings.
The generalized suffix tree provides an efficient way
to finding frequent subsequences in multiple DNA
sequences. Interested readers are referred to the
abovementioned references for details on (generalized)
suffix tree algorithms.
3.2 DNA Sequences Vectorization
Having obtained frequent subsequences, we take each
subsequences as a feature in the feature space, and con-
vert an original DNA sequence into a vector of the fea-
ture space.
Readers familiar with text categorization may draw an
analogy from this step to the text representation, by
viewing a DNA sequence as a document and a frequent
subsequence as a word. Similar to text categorization,
we can have different codings to represent a DNA se-
quence using the frequent subsequences. In our later
experiments, we mainly use the binary coding: if a se-
quence contains a subsequences feature, it is given value
“1” on that feature, otherwise, the feature value is “0”.
This expression, though simple, proves to be very effec-
tive. It is also possible to use other expression such as
TFIDF [10].
Here is a toy example (continued): following the
above toy example, the three sequences can be vector-
ized using the binary coding. Let the feature space be the
frequent subsequences set {A,C, G, T,AC,CG,GT}, the
sequence S1 = ACGT is converted to vector (1,1,1,1,1,1,
1), S2 = CGTA becomes (1,1,1,1,0,1,1), and the se-
quence S3 = GTAC becomes (1,1,1,1,1,0,1).
3.3 SVM Classification
SVM is based on the structural risk minimization princi-
ple from statistical learning theory introduced by V.
Vapnik [6].
In its basic form, SVM learns linear decision hyper-
bxwsignxh )( , described by a weight vec-
tor w and a threshold b. The input is a set of n training
example ,)),(,),,(( 11
nnn RyxyxS 
For a linearly separable Sn, the SVM finds the hyper-
plane with maximum Euclidean distance to the closest
training examples. This distance is called the margin δ,
as depicted in Figure 2. For a nonlinearly separable Sn,
the SVM uses either a kernel function to map the origi-
nal data into another dimensional space where the data
points are linearly separable, or a hyperplane with a soft
margin the allows some degree of training error in order
to obtain a large margin. Computing the hyperplane is
Figure 2. A linear SVM for a two dimensional training set.
B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 187
SciRes Copyright © 2009 JBiSE
equivalent to solving an optimization problem. The re-
sulting model is actually the decision hyperplane, ex-
pressed by w and b.
Note that the resulting weights can be actually viewed
as a ranking of the features: a positive (negative) weight
means the corresponding feature values contribute to the
positive (negative) class; the larger the absolute weight
value, the more important the feature is to the classifica-
tion; weights of zero or very small values have little
contribution to classification. Later we will use this
ranking information to define what we call the measure
of subsequence specificity.
We decide to use SVM for classification because it
has proved to have a number of advantages over other
classification algorithms. First, it can easily handle very
high dimensional input space [11]. Second, it works well
when the problem contain few irrelevant features [11].
Third, it does not require feature selection as a preproc-
essing step, because it can automatically ignore irrele-
vant features during the optimization [11]. Fourth, the
weights contained in the resulting SVM model can be
used to rank the features. All these are essential to solv-
ing the target problem.
In this paper, we use the popular SVM-light imple-
mentation of SVM [12]. Note that when training SVM
classifier, one can select a kernel function suitable for
the studied problem. Motivated by the fact that text
documents are in general linearly separable [11], in our
later experiments we only consider linear SVM model
which is used by default in SVMlight.
3.4 Measure of Good Classifiers
Usually the performance of a classifier is measured by
classification accuracy, precision and/or recall. They are
defined based on a confusion matrix as shown in Table I.
The Precision of the positive class is defined as P=
TP=(TP+FP), while the Recall of the positive class is
defined as R=TP/(TP+FN). The overall Accuracy is de-
fines as A=(TP+TN)/(TP+FP+TN+FN). In our method,
the performance of SVM classifier is not measured by
the overall accuracy, because negative training se-
quences would be much more than those positives, and
the overall accuracy would be heavily affected by the
overwhelming negatives.
Besides, as our goal is to identify the specific subse-
quences in positive species, we hope to maximize the
precision (i.e., the probability of the classifier making a
correct prediction is high). Therefore, we will only use
precision as the measure of classifier’s performance. At
the same time, we will report corresponding recalls for
reference. Note that the precision of the classifier should
be higher than the odds that one makes a random guess
for the positive class label, in order to be considered as a
good classification. Otherwise, the classifier does not
make sense. For example, if a training data set contains
Table 1. Confusion matrix in classification.
Actual number of
positive examples
Actual number of
negative examples
Number of examples
classified as positiveTrue Positive (TP) False Positive (FP)
Number of examples
classified as negativeFalse Negative (FN) True Negative (TN)
100 positive examples and 100 negative examples, if one
randomly guesses for the class label of any of the 200
examples, the probability of making a correct guess is
obviously 50%. If a classifier than a random guess, then
the classifier is useless. A bad classification may suggest
that the information contains in the training examples is
probably insufficient for the classifier to distinguish the
positive examples from the negative examples.
In our method, we take this as an indication that the
positive species probably contains no SSS. In other
words, we consider a classifier is good only when its
precision is much higher than the random guess odds. In
our later experiments, the precision is obtained by test-
ing against a reserved portion of the total data, which is
unseen during the classifier is trained.
3.5 Subsequence Specificity Measure
Note that the resulting classifier will not be used as usual
to predict class label for an unseen sequence. Remember
the task here is to identify the SSS. For this purpose, we
make use of the weights of features in the SVM model,
to define what we call subsequence specificity to meas-
ure how specific a subsequence f is to the positive spe-
cies as follows:
here weight (f) is the weight of feature f obtainable from
the learned SVM model, support (f,+) is a measure of
how many sequences of positive species contain f, and
confidence(f,+) is a measure of how many sequences
that contain f belong to positive species.
The definition of the subsequence specificity is based
on the following observations and expectations:
1) An SSS should occur frequently in the positive
species. This means it should have a high support.
2) An SSS should occur more frequently in the posi-
tive species than in the negative species. This means it
should have a high confidence.
3) Having a high support and/or a high confidence
does not necessarily mean a high distinguishing power,
while this can be reliably judged by the feature weights
in SVM model.
4) SVM may give more weight on features having low
support and/or confidence. This means the weights alone
are insufficient to measure the specificity.
5) Each of the three quantities alone characterizes the
subsequence specificity from an unique angle. Combing
188 B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189
SciRes Copyright © 2009 JBiSE
them together should be able to give a more accurate
characterization of the specificity, and thus is expected
to be a more reliable measure.
4.1 The Data Set
The data set used in this project is provided by the Brink-
man Libratory of SFU Molecular Biology and Biochemis-
try Department. It contains known rho-independent tran-
scription terminator sequences for 42 bacteria genomes.
The source of the data set is The Institute for Genomic
Research (TIGR). Each record of the data set contains one
terminator sequence (including left tail, stem and loop, and
right tail) and the taxonomy id of the corresponding bacte-
rium. The total number sequences is 12763, and 18 species
have less than 100 terminator sequences.
Besides, we are also provided with a file containing
the taxonomy information of all bacteria species. Figure
3 shows the taxonomy tree for the 42 bacteria, obtained
from the file. The tree shows the evolution paths for the
42 bacteria. Each node in the tree represents a bacterium
species, and marked by its taxonomy id. The leaf nodes
are enclosed by a parenthesis. Note that among the 42
bacteria, 41 are at leaf nodes, except species 562, which
has species 83334 as its children. This is the only pair of
species that has a direct parent-children relationship in
the given data set.
The task is to identify those DNA subsequences that
are specific to certain species. The hope is that this
should provide a better understanding of bacteria evolu-
tion and how it plays a role in genetic exchange be-
tween pathogens. Note that for a given species, it does
not necessarily have any subsequences specific to it.
Also, when talking about the specificity, one should
consider the corresponding scope. That is, with respect
to which species a subsequence is considered specific.
For example, a subsequence can be specific to bacte-
rium A with respect to another bacterium B. Also, the
subsequence can be specific to bacterium A with respect
to all other bacteria.
Among the 42 species, we evaluate our methods on
four bacteria whose SSS for the TTs are already known
biologically. The four species are: Haemophilus influ-
enzae, Neisseria meningitidis, Pasteurella multocida,
and Pseudomonas aeruginosa [3]. As the SSS for the
four bacteria are all found in the left inverted repeat (i.e.,
the left stem) of corresponding TTs, we will also use the
left stem of each TT sequence to find the SSS. For each
of the four species, we employ the one-against-
all-others approach to build classifier by taking all its
TT sequences (the left stems) as positive sequences
while those of the other 41 species as negative se-
uences. q
4.2. Experimental Methodology
All experiments are done using 5-fold cross validation.
Averaged values over the five round experiments are
reported. The linear classifiers are trained using SVM-
light default settings. The frequent subsequences are
obtained from the positive training sequences by setting
the minimum support to be 1% of the total number of
positive sequences (and the absolute support number no
less than 2). These frequent subsequences represent sta-
tistically significant features with regard to the positive
species, while resulting in substantially lower dimen-
sions compared to the feature space of all potential sub-
sequences. The classification performance is evaluated
by the precisions, while the corresponding recalls are
reported for reference.
4.3 Experimental Results
The results are given in Table II, from which we can see
that the precisions for all the four species are signifi-
cantly higher than the respective random guess odds.
This implies that there are subsequence features that can
distinguish the positive species from all the others. We
then output the top 10 subsequences based on the sub-
sequence specificity measure. The results are given in
Ta b l e 2. In Ta ble 3, the resulting subsequences that are
the closest to the known SSS are underlined.
We can see that for species Pasteurella multocida and
Pseudomonas aeruginosa, the underlined subsequences
are exactly the same as the known SSS. For the Haemo-
philus influenzae and Neisseria memingitidis, our results
are very close to the known SSS: the underlined ones are
very close to the known SSS. In all cases, our top 10
subsequences are almost the substrings of the known
SSS. This shows that our method is rather effective in
finding SSSs. Note that the known SSS are found by
applying biological domain knowledge [3], for example,
discarding those biologically meaningless subsequences
and/or limiting the length of the subsequences, which is
not considered in our method. Therefore, it can be ex-
pected that our method can be improved by incorporat-
ing domain knowledge.
Table 2. Classification performance for the four bacteria.
Name of the
# of Freq
Odds of
influenzae 325 55.81% 14.18% 3.19%
meningitidis 360 99.61% 50.67% 3.62%
multocida 389 48.0% 1.09% 3.73%
aeruginosa 296 67.77% 9.34% 4.28%
B. H. GU et al. / J. Biomedical Science and Engineering 2 (2009) 184-189 189
SciRes Copyright © 2009 JBiSE
Table 3. The top 10 subsequences with the greatest specificity values for the four species.
The Subsequences found for each Bacterium
Top spec
rank Haemophilus Influenzae Neisseria meningitidis Pasteurella multocida Pseudomonas aeruginosa
1 0
In this paper, our goal is to find the species-specific
subsequences for bacteria transcription terminators. By
treating the problem as a classification problem, we
propose a solution based on frequent subsequences and
Support Vector Machines. We first find frequent subse-
quences from the terminator DNA sequences of the posi-
tive species. We then take all such subsequences as fea-
ture space to transform the original DNA sequences into
SVM readable vectors and train SVM classifiers. The
resulting classifiers are used indicators of the existence
of the SSSs. In order to extract the target subsequences
from the SVM model, we make use of the SVM weights
on the features and define a measure called subsequence
specificity. The most significant features based on the
measure are output as the SSS. Our experiments show
that this method is effective. As a conclusion, we have
presented a novel application of classification to compu-
tational biology.
Although the proposed method is designed and evalu-
ated on DNA terminator sequences of bacteria, we be-
lieve that it is applicable to other similar biology tasks
with perhaps minor modifications. As for future work, it
is desirable to evaluate the proposed method on more
bacteria. Besides, the proposed method itself can be im-
proved in many ways. For example, to refine the speci-
ficity measure to make it more accurate, and to find bet-
ter methods to express DNA sequences for classification.
[1] P. Turner, (2000) Molecular Biology.
[2] M. D. Ermolaeva, H. G. Khalak, O. White, H. O. Smith,
and S. L. Salzberg, (2000) Prediction of transcription
terminators in bacterial genomes, Journal of Molecular
Biology, 301, 27-33.
[3] T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, and
T. Rognes, (2004) Biased distribution of dna uptake se-
quences towards genome maintenance genes, Nucleic
Acids Research, 32(3), 1050-1058.
[4] C. J. C. Burges, (1998) A tutorial on support vector ma-
chines for pattern recognition, Knowledge Discovery and
Data Mining, 2(2).
[5] T. Joachims, (2002) Optimizing search engines using
clickthrough data, in Proceedings of the ACM Confer-
ence on Knowledge Discovery and Data Mining
[6] V. Vapnik, (1995) The Nature of Statistical Learning
Theory. Springer.
[7] S. Rong, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F.
S. L. Brinkman, (2003) Frequent-subsequence-based
prediction of outer membrane proteins, in Proceedings of
2003 ACM SIGKDD Conference.
[8] M. Deshpande and G. Karypis, (2002) Evaluation of
techniques for classifying biological sequences, in Pro-
ceedings of Pacific Asia Conference on Knowledge Dis-
covery and Data Mining (PAKDD-2002).
[9] D. Gusfield, (1997) Algorithms on strings, trees, and
sequences: computer science and computational biology,
Cambridge University Press.
[10] G. Salton and C. Buckley, Term weighting approaches in
automatic text retrieval, Information Processing and
Management, 24(5), 1988.
[11] T. Joachims, (1998) Text categorization with support
vector machines: Learning with many relevant features,
in Proceedings of the European Conference on Machine
Learning (ECML-1998).
[12] (2002) Svmlight support vector machine, web.
[13] B. Gu, (2007) Discovering species-specific transcription
terminators for bacteria, School of Computing Science,
Simon Fraser University, Tech. Rep.
[14] J. R. Quinlan, (1993) C4.5: Programs for machine learn-
ing, Morgan Kaufmann Publisher.
[15] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.
Stone, (1984) Classification and regression trees,
[16] H. O. Lancaster, (1969) The chi-squared distribution,
John & Sons.