A reduced computational load protein coding predictor using equivalent amino acid sequence of DNA string with period-3 based time and frequency domain analysis

doi:10.4236/ajmb.2011.12010

Paper Menu >>

Journal Menu >>

American Journal of Molecular Biology, 2011, 1, 79-86

doi:10.4236/ajmb.2011.12010 Published Online June 2011 (http://www.SciRP.org/journal/ajmb/ AJMB

Published Online July 2011 in SciRes. http://www.scirp.org/journal/AJMB

A reduced computational load protein coding predictor using

equivalent amino acid sequence of DNA string with period-3

based time and frequency domain analysis

J. K. Meher1, G. N. Dash2, P. K. Meher3, M. K. Raval4

1Department of Computer Science and Engineering, Vikash College of Engineering for Women, Bargarh, Orissa, India;

2School of Physics, Sambalpur University, Orissa, India;

3Department of Embedded System, Institute for Infocomm Research, Singapore;

4PG Department of Chemistry, G.M. College, Sambalpur, Orissa, India.

E-mail: jk_meher@yahoo.co.in, gndash@ieee.org, pkmeher@ieee.org, mraval@yahoo.com

Received 12 May 2011; revised 14 June 2011; accepted 29 June 2011.

ABSTRACT

Development of efficient gene prediction algorithms

is one of the fundamental efforts in gene prediction

study in the area of genomics. In genomic signal

processing the basic step of the identification of pro-

tein coding regions in DNA sequences is based on the

period-3 property exhibited by nucleotides in exons.

Several approaches based on signal processing tools

and numerical representations have been applied to

solve this problem, trying to achieve more accurate

predictions. This paper presents a new indicator se-

quence based on amino acid sequence, called as ami-

noacid indicator sequence, derived from DNA string

that uses the existing signal processing based time-

domain and frequency domain methods to predict

these regions within the billions long DNA sequence

of eukaryotic cells which reduces the computational

load by one-third. It is known that each triplet of

bases, called as codon, instructs the cell machinery to

synthesize an amino acid. The codon sequence there-

fore uniquely identifies an amino acid sequence

which defines a protein. Thus the protein coding re-

gion is attributed by the codons in amino acid se-

quence. This property is used for detection of period-

3 regions using amino acid sequence. Physico-chemi-

cal properties of amino acids are used for numerical

representation. Various accuracy measures such as

exonic peaks, discriminating factor, sensitivity, speci-

ficity, miss rate, wrong rate and approximate corre-

lation are used to demonstrate the efficacy of the

proposed predictor. The proposed method is vali-

dated on various organisms using the standard data-

set HMR195, Burset and Guigo and KEGG. The si-

mulation result shows that the proposed method is an

effective appro ach for protein codin g pre di ction.

Keywords: Genomics; Bioinformatics; Codon; Coding

Region; Amino Acid Sequence; Fourier Transform;

Antinotch Filter; Periodicity-3; Indicator Sequence

1. INTRODUCTION

Over the past few decades, major advances in the field

of molecular biology, coupled with advances in genomic

technologies, have led to an exponential growth of ge-

nomic sequences. An important step in genomic annota-

tion is to identify protein coding regions of genomic

sequences, which is a challenging problem especially in

the study of eukaryote genomes. In eukaryote genome,

protein coding regions (exons) are usually not continu-

ous [1]. Due to the lack of obvious sequence features

between exons and introns, distinguishing protein coding

regions effectively from noncoding regions is a chal-

lenging problem in bioinformatics. Gene Prediction re-

fers to detecting locations of the protein-coding regions

of genes in a long DNA sequence. For most prokaryotic

DNA sequences, the problem is to determine which

segments, in the given sequence, are really coding se-

quences coding for proteins. For eukaryotic DNA se-

quences, the problem is to determine how many exons

and introns (non-coding regions) are there in the given

sequence and what are the exact boundaries between the

exons and introns [2].

For the last few decades, the major task of DNA and

protein analysis, has been on string matching, either with

a goal of obtaining a precise solution, e.g., with dynamic

programming, or more commonly a fast solution, e.g.,

with heuristic techniques such as BLAST and several

versions of FASTA [3]. But any of the string matching

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86

methodologies could not lead to satisfactory results. A

variety of computational algorithms have been devel-

oped to predict exons. Most of the exon-finding algo-

rithms are based on statistics methods, which usually use

training data sets from known exon and intron sequences

to compute prediction functions. As examples, GenScan

algorithm [1,2] measured distinct statistics features of

exons and introns within genomes and employed them in

prediction via hidden Markov model (HMM).

Signal processing techniques offer a great promise in

analyzing genomic data because of its digital nature.

Signal processing analysis of bio-molecular sequences

plays important role for their representation as strings of

characters [4,5]. If numerical values are assigned to

these characters, the resulting numerical sequences are

readily applicable to digital signal processing. During

recent years, signal processing approaches have been

attracting significant attentions in genomic DNA re-

search and have become increasingly important to elu-

cidate genome structures because they may identify hid-

den periodicities and features which cannot be revealed

easily by conventional statistics methods [6,7]. After

converting symbol DNA sequences to numerical se-

quences, signal processing tools, typically, discrete Fou-

rier transform (DFT) or digital filter can be applied to

the numerical vectors to study the frequency domain of

the sequences [8]. For most of DNA sequences, one of

the principal features is the periodic 3-nucleotide pattern

which has been known phenomenon for eukaryotic ex-

ons. DNA periodicity in exons is determined by codon

usage frequencies. There has been a great deal of work

done in applying signal processing methods to DNA

recently. The discrete Fourier transform and antinotch

filter are applied based on the period-3 property.

The DFT of a given input DNA sequence exhibits a

peak at the frequency 2/3 due to periodicity in the se-

quence [9]. The DNA sequence consisting of indicator

sequence {x(n)} of the four bases can be represented in

corresponding binary sequences xA(n), xT(n), xC(n) and

xG(n). The DFT of length N for input binary sequence

xA(n) is defined by

12/

() ()

kn N

Xk xne









 

 (1)

Similarly, XT[k], XC[k] and XG[k] can be found out and

the total power at frequency k then be expressed as

() ()()()()

AT CG

SkXk XkXk Xk (2)

The frequency spectrum of S[k], is found to exhibit a

peak at k = N/3 which indicates the presence of a coding

region in the gene.

In digital filtering, for each indicator sequence xA(n),

xT(n), xC(n) and xG(n), a corresponding filter output YA(n),

YT(n), YC(n) and YG(n), respectively are computed. The

sum of the square of magnitude of these filter outputs is

expressed as

222

() ()()()()

AT CG

YnYn YnYn Yn

(3)

A plot of Y(n) has been used to extract the period-3

region of the DNA effectively [9]. This principle has

been applied in antinotch filter and multistage filter. The

notch filter is a bandpass filter with passband centered at



= 2/3 and minimum stop-band attenuation of about

13 dB. The antinotch filter is a power complementary of

notch filter.

In Ref. [6], Tiwari, et al. utilized Fourier analysis to

detect the probable coding regions in DNA sequences,

by computing the amplitude profile of this spectral

component which is evidenced as a sharp peak at fre-

quency f = 1/3 in the power spectrum. The strength of

the peak depends markedly on the gene. Anastassiou

proposed a mapping technique to optimize gene predic-

tion using Fourier analysis and introduced color spectro-

gram for exon prediction [7]. Although this mapping

technique produced comparatively good results than

DFT but it was DNA sequence dependent and thus re-

quires computation of the mapping scheme before proc-

essing for gene prediction. To improve the filtering

through DFT computation, P. P. Vaidyanathan, in [9],

proposed digital resonator (antinotch filter) to extract the

period-3 components. Short time Fourier transform

(STFT) with entropy based methods is incorporated to

increase its efficacy to identify the homogeneous regions.

[10]. Identification of protein coding regions was devel-

oped using modified Gabor-Wavelet transform [11] for

the having advantage of being independent of the win-

dow length. Entropy minimization criterion in DNA

sequences is discussed by Galleani and Garello [12].

Tuqan and Rushdi [13] had explained 3-periodicity re-

lated to the codon bias using two stage digital filter and

multirate DSP model. Criteria to select the numerical

values to represent genomic sequences are discussed by

Akhtar et al. [14,15].

Genomic information is digital in a real sense; it is

represented in the form of sequences of which each ele-

ment can be one out of a finite number of entities. Such

sequences, like DNA and proteins, have been represented

by character strings, in which each character is a letter of

an alphabet. The first step in gene prediction principle in

genomic signal processing involves conversion of string

space into signal space of binary numbers called as the

indicator sequence. Voss binary representation [16] is the

fundamental approach of numerical representation. Var-

ious DNA numerical signal representations have been

adopted using z-curve [17,18], complex numbers [19],

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86 81

quaternion [20], Gailos field assignment [21], EIIP [22,

23], paired numeric [14] to make indicator sequence in

DSP methods to improve the accuracy of exons predic-

tion. Another four-indicator sequence called as relative

frequency indicator sequence based on various coding

statistics like single-nucleotide, dinucleotide and trinu-

cleotide biases are incorporated into the algorithm to

improve the selectivity and sensitivity of filter methods

[24]. Real-number representation maps A = 1.5, T = –1.5,

C = 0.5, and G = –0.5 similar to the complementary

property of the complex method are used in [14].

Despite many progresses being made in the identifica-

tion of protein coding regions by computational methods

the performances and efficiencies of the prediction me-

thods still need to be improved. It is indispensable to

develop new prediction methods to improve the predic-

tion accuracy. The existing numerical encoding methods

can be classified into four-indicator sequences, three-

indicator sequences and single-indicator sequences

based on computational overhead. The single-indicator

sequ- ence reduces the computational overhead by 75%

in compared to four-indicator sequence.

A new method to predict protein coding regions is

developed in this paper based on the amino acid indica-

tor sequence obtained from DNA string that exon se-

quences have a 3-base periodicity, while intron sequen-

ces do not have this unique feature. The method com-

putes the 3-base periodicity and the background noise of

the stepwise amino acid segments of the target amino

acid sequences using distributions in the codon positions

of the amino acid sequences. The proposed single indi-

cator sequence based on amino acids reduces further the

computational load by one-third.

The rest of the paper is organized as follows. Section-

2 presents amino acid indicator sequence approach for

identification of protein coding regions using Fourier

transform and digital filter. Section-3 focuses on the re-

sults of the proposed methods with accuracy measures

and validated with standard datasets such as HMR195,

Burset and Guigo and KEGG. Section-4 presents the

conclusions of this paper.

2. PROPOSED AMINO ACID INDICAT OR

SEQUENCE

It is known that each triplet of bases, called as codon,

instructs the cell machinery to synthesize an amino acid.

The codon sequence therefore uniquely identifies an

amino acid sequence which defines a protein. Thus the

protein coding region is attributed by the codons in

amino acid sequence [2]. This property is used for detec-

tion of period-3 regions using amino acid sequence. The

period-3 property is related to difference in the statistical

distributions of codon sequence between protein-coding

Figure 1. Central Dogma of molecular biology.

and non-coding sections. This periodicity reflects corre-

lations between residue positions along coding se-

quences.

The genetic information contained in DNA sequences,

RNA sequences, and proteins is extracted in Genomic

signal processing. A DNA sequence is made from an

alphabet of four elements, namely A, T, C, and G mole-

cules called nuclotides or bases. This quarternary code

of DNA contains the genetic information of living or-

ganisms. Similarly protein is also a discrete-alphabet

sequences that imparts genetic information and large

number of functions in living organism. A protein can be

represented as a sequence of amino acids. There are

twenty distinct amino acids, and so a protein can be re-

garded as a sequence defined on an alphabet of size

twenty. The twenty letters used to denote the amino ac-

ids are the letters from the English alphabet such as

ACDEFGHIKLMNPQRSTVWY. It is common that

some letters representing amino acids are identical to

some letters representing bases. For example the A in the

DNA is a base called adenine, and the A in the protein is

an amino acid called alanine. It is known that each gene

is responsible for the creation of a specific protein when

expressed and this is called as central dogma of molecu-

lar biology [2] as shown in Figure 1.

The information of expression of particular protein

from a gene is contained in a code which is common to

all life. The gene gets duplicated into the mRNA mole-

cule which is then spliced so that it contains only the

exons of the gene. Each triplet of three adjacent bases of

mRNA is called a codon. There are 64 possible codons.

Thus the mRNA is nothing but a sequence of codons.

Each codon instructs the cell machinery to synthesize a

protein using the genetic code. When all the codons in

the mRNA are exhausted we get a long chain of amino

acids. This is the protein corresponding to the original

gene.

In practice numerical values are assigned to the four

letters in the DNA sequence to perform a number of

signal processing operations such as Fourier transforma-

tion, digital filtering, time-frequency plots such as wave-

let transformations. Similarly, once we assign numerical

values to the twenty amino acids in protein sequences

we can do useful signal processing.

The new proposed predictor is based on the analysis of

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86

Table 1. The genetic code.

S.N. Amino acids Codon

1 A Alanine GCA, GCC, GCG, GCT

2 C Cysteine TGC, TGT

3 D Aspartic acid GAG, GAT

4 E Glutamic acid GAA, GAG

5 F Phenylalanine TTC, TTT

6 G Glycine GGA, GGC, GGT, GGG

7 H Histidine CAC, CAT

8 I Isoleucine ATA, ATC, ATT

9 K Lysine AAA, AAG

10 L Leucine TTA, TTG,CTA, CTC,

CTG, CTT

11 M Methionine ATG

12 N Asparagine AAC, AAT

13 P Proline CCA, CCC, CCG, CCT

14 Q Glutamine CAA, CAG

15 R Arginine AGA, AGG, CGA, CGC, CGG,

CGT

16 S Serine AGC, AGT, TCA, TCC,

TCG, TCT

17 T Threonine ACA, ACC, ACG, ACT

18 V Valine GTA, GTC, GTG, GTT

19 W Tryptophan TGG

20 Y Tyrosine TAG, TAT

amino acid sequence. In this work the DNA sequence is

converted to amino acid sequence i.e., the A, T, C, G

language is converted to amino acid language [14]. Three

characters consisting of nucleotides are represented as

codon consisting of twenty alphabets of aminoacids. The

mapping from amino acids to codons is many-to-one

(Table 1). For a given DNA sequence xB(n), where B is

nucleotide bases, the corresponding amino acid sequence

is obtained as xR(n), where R represents 20 amino acids.

For example



ATGGGTCC AGCTCCAGTTTTCCC

AAATTCG CGGAA GCC GG CGACACT

xn 











 

MGPAPVFPNSRKPAT

xn

The most relevant for the application of signal proc-

essing tools is the assignation of properties of amino

acid alphabets to form amino acid indicator sequence.

There are several approaches to convert genomic infor-

mation in numeric sequences using different representa-

tions. Physico-chemical properties of amino acids such

as volume, charge, area, EIIP, dipole moment, alpha etc

obtained from Hyperchempro 8.0 software of Hyper-

CubeInc, USA are used in this paper for analysis of the

proteins (Ta bl e 2 ). The resulting numerical sequence by

substituting these values is called amino acid indicator

sequence.

Each amino acid is associated with a unique number

of alpha propensities. The indicator sequence is obtained

by spreading the numerical value on the amino acid se-

quence.

{1.501 1.058 0.519 1.409 0.519 1.694 1.966

0.519 0.434 0.774 0.240 0.181 0.519 1.409 0.828}

x

Table 2. Physico-chemical properties of amino acids.

Amino acid Alpha EIIP Dipole moment

A 1.409 0.0373 5.937

R 0.240 0.0959 37.5

N 0.434 0.0036 18.89

D 0.192 0.1263 29.49

C 1.069 0.0829 10.74

Q 0.333 0.0761 39.89

E 0.175 0.0058 42.52

G 1.058 0.0050 0.0

H 0.558 0.0242 20.44

L 1.702 0.0000 3.782

I 1.990 0.0000 3.371

K 0.181 0.0371 50.02

M 1.501 0.0823 8.589

F 1.966 0.0946 5.98

P 0.519 0.0198 7.916

S 0.774 0.0829 9.836

T 0.828 0.0941 9.304

W 1.314 0.0548 10.73

Y 0.979 0.0516 10.41

V 1.694 0.0057 2.692

One of the advantages of using amino acid indicator

sequences lies in reducing computational load by

one-third as compared to processing DNA indicator se-

quence.

This technique has been used to identify the coding

region which can predict whether a given sequence

frame, limited to a specific length N, belongs to a coding

region or not. This is done by sliding frame in which the

amino acids of length N of the frame are rated. After that

the frame is shifted through a fixed number of samples

of residues downstream. The output of every rated win-

dow belongs to residues at the specific position. The

existence of three-base periodicity exhibited by the se-

quence as a sharp peak at frequency f = 1/3 in the power

spectrum in the protein coding regions helps in the pre-

diction of exons.

The discrete Fourier transform (DFT) has been used to

predict coding regions in equivalent amino acid se-

quences of DNA string. As a consequence of the non-uni-

form distribution of codons in coding regions, a three-

periodicity is present in most of genome coding regions,

which show a notable peak at the frequency component

N/3 when calculating their DFT. The DFT of length N for

input amino acid indicator sequence xAA(n) is defined by

12π/

() ()e

kn N

AA AA

Xk xn







, (4) 01kN 

for AA = amino acids. The absolute value of power of

DFT coefficients is given by

()| ()|

SkX k





 (5)

The plot of S(k) against k, results in peak at k = N/3 due

to the period-3 property, that indicates the presence of

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86 83

coding regions.

Taking into account the validity of this result the an-

tinotch filter has been applied to amino acid sequences to

predict coding regions, using a sliding frame along the

sequence. In digital filtering method for indicator se-

quence xAA(n), corresponding filter output YAA(n) is

computed where AA represents 20 amino acids. The sum

of the square of magnitude of these filter outputs is ex-

pressed as

()| ()|

YnYn





 (6)

A plot of Y(n) has been used to extract the period-3

region of the of the sequence effectively. Prediction of

protein coding regions can be summarized as the fol-

lowing sequence of steps.

1. Convert DNA string to equivalent amino acid se-

quence with three character code.

2. Substitute physico-chemical properties of amino

acid to construct indicator sequence.

3. Apply this sequence to DFT or digital filter to de-

tect period-3 regions.

4. Observe peaks for determining protein coding re-

gions.

5. Evaluate assessment parameters to check accuracy.

3. RESULT AND DISCUSSION

In this paper we propose the technique of using amino

acid indicator sequence for prediction of protein coding

region in gene sequence. We have used digital filtering

techniques, such as antinotch filter to detect the protein

coding segments using the existing indicator sequences as

well as the proposed single indicator sequences based on

physico-chemical properties for several organisms.

Mainly, three data sets Burset and Guigo [25], HMR195

[26] and KEGG [27] are used for validation of proposed

method. The proposed methods performed well in a good

number of cases.

The accuracy measures for evaluating the different

methods used in this paper are exon-intron discrimina-

tion factor D [23], sensitivity (SN), specificity (SP), miss

rate (MR), wrong rate (WR) [3,15] and approximate cor-

relation [28]. The discriminating factor is defined as

Lowest of exon peaks

Highest peak in noncoding regions

D (7)

The miss rate and wrong rate are defined as

 (8)

RWE

WPE

 (9)

where ME = missing exons, AE = actural exons, WE =

Table 3. Summary of performance evaluation of amino acid

indicator sequence.

Assessment Parameters

Dataset D SN S

P W

R M

RAC

Burset and

Guigo 3.8 1 0.85 0 0.330.93

HMR1953.5 1 0.82 0 0.250.91

KEGG 2.2 1 0.75 0 0.280.89

wrong exons, PE = predicted exons.

We define TP (true positives) as the number of coding

regions predicted as coding; TN (true negatives) as the

number of noncoding regions predicted as noncoding, FP

(false positives) as the number of noncoding regions

predicted as coding, and FN (false negatives) as the

number of coding regions predicted as noncoding. Based

on these parameters, sensitivity and specificity are de-

fined as

STF

 (10)

STF

 (11)

These are widely used measures of accuracy for gene

prediction programs. Another measure that captures both

specificity and sensitivity is AC (approximate correla-

tion). AC is defined by

10.5 2

PNPP

NPNN

TFTF

AC TT

TFTF













































(12)

If D is more than one (D > 1), all exons are identified.

High sensitivity and specificity are desirable for higher

accuracy. Low miss rate and wrong rate are desirable for

better result. The list of genes of organisms is processed

with the proposed single-indicator sequences using fil-

tering method and corresponding gene prediction meas-

ures have been evaluated. Table 3 summarizes the ob-

servations of eight genes from Burset and Guigo dataset,

HMR195 and KEGG dataset. In all the examples cited,

the proposed encoding methods show better discrimina-

tion compared to the method using multiple indicator

sequences. The simulation result shows high discrimi-

nating factor, sensitivity and specificity with low miss

rate and wrong rate for the proposed methods.

Table 3 summarizes the average performance of pro-

posed method on each dataset. The simulation results

using filtering approach on list of selected genes from

three datasets are shown in Table 4. It is found that the

single-indicator sequences based on amino acid sequence

show high peak at protein coding locations.

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86

Ta bl e 4. Simulation results on selected genes from Burset and

Guigo dataset, HMR195 and KEGG dataset.

Gene Name,

Acc. No.

Numerical

Representations Accuracy Measures

Voss D SN S

P M

R W

RAC

Real numbers 2.75 1 0.66 0 0.50.84

Raltive frequency2.1 1 0.66 0 0.50.84

EIIP 3 1 0.66 0 0.50.84

Amino acid 2 1 0.66 0 0.50.84

HSODF2,

X74614,

Homo Sapiens

ODF2 gene

Voss 3.5 1 0.75 0 0.330.89

Real numbers 11 1 1 0 01

Raltive frequency12 1 1 0 01

EIIP 14 1 1 0 01

Amino acid 20.6 1 1 0 01

PP32R1,

AF00A216,

Homo Sapiens

Voss 22 1 1 0 01

Real numbers 1.2 1 0.75 0 0.250.9

Raltive frequency1 1 0.66 0 0.50.83

EIIP 1.04 1 0.66 0 0.50.83

Amino acid 1.5 1 0.75 0 0.250.91

Humbetgloa,

26462,

human

betaglobin Voss 1.8 1 0.75 0 0.250.91

Real numbers 1.45 1 0.66 0 0.330.89

Raltive frequency1 1 0.66 0 0.330.89

EIIP 1.04 1 0.5

0 0.50.78

Amino acid 4 1 0.5 0 0.50.78

CLDN3,

AF007189,

Homo sapiens

Claudin 3 Voss 1.1 1 0.66 0 0.330.86

Real numbers 2.2 1 0.66 0 0.50.86

Raltive frequency 1.33 1 0.66 0 0.50.86

EIIP 3 1 0.66 0 0.50.86

Amino acid 1.33 1 0.66 0 0.50.86

D p19,

AFO61327,

Homo sapiens

cyclin-dependent

kinase 4 inhibitor Voss 2.5 1 0.66 0 0.50.86

Real numbers 2 0.66 0.66 0.5 0.50.66

Raltive frequency 1.33 1 0.66 0 0.50.86

EIIP 3.2 1 0.66 0 0.50.86

Amino acid 5 1 1 0 01

GalR2,

AF042784,

Musculus galin

receptor

type 2 gene Voss 5.2 1 1 0 01

Real numbers 2 1 0.66 0 0.50.86

Raltive frequency1.3 1 0.66 0 0.50.86

EIIP 1.8 1 0.66 0 0.50.86

Amino acid 2 1 1 0 01

NC_002650 Tre-

ponema Denticola

U9b Plasmid pTS1

Voss 2.2 1 1

0 01

Real numbers 1.1 1 0.6 0 0.50.86

Raltive frequency1.3 1 0.6 0 0.50.86

EIIP 1.3 1 0.75 0 0.330.89

Amino acid 1.4 1 0.75 0 0.330.89

NC_004767 Heli-

cobacter pylory

plamid pHP51

1.8 1 0.75 0 0.330.89

The gene sequences “F56 F11.4a” from “Chromo-

some III” of the organism “C.elegans” (Accession

Number AF099922), HUMELAFIN (D13156) of

Homo sapiens and ODF2 of Homo sapiens are used

for detecting protein coding regions. All the exons of

three genes mentioned above are correctly identified

as shown in Figure 2. In particular Figure 2(a) shows

the exon prediction results for gene F56 F11.4a

showing five peaks corresponding to the exons loca-

tions. The simulation result using MATLAB 7.0

shows that of the proposed technique identifies even

short sequence. This is observed in first peak of gene

F56 F11.4a, whereas it is not pronounced in tradi-

tional methods. Similarly Figure 2(b) shows two

peaks for two exons in gene Humelafin and Figure

2(c) shows two peaks for two exons in gene ODF2.

The length of amino acid sequence is one-third of that

Figure 2. Gene prediction using Amino acid indicator

sequence of genes (a) F56F11.4a of C.Elegans chro-

mosome III showing five exons (b) HUMELAFIN of

Homo sapiens showing two exons (c) ODF2 of Ho-

mo sapiens showing two exons.

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86 85

of DNA sequence. Hence the exon locations need to be

mapped due to reduction of size of the string.

The proposed indicator sequence consisting of alpha

propensity, dipole moment and EIIP of amino acids are

used for numerical representation and produce sharp

peaks at exon locations as well as suppresses the false

exons. False exons are the peaks observed in intron loca-

tions which do not take part in protein coding. Thus the

proposed method is more sensitive to detect true exons

which take part in protein coding. Again the execution of

reduced sequence due to representation of codons i.e.,

amino acid sequence reduces the computation time to

one-third as compared to the execution of whole se-

quence of original DNA sequence. Thus the proposed

method in not only fast but also efficient.

4. CONCLUSIONS

The new proposed predictor for protein coding regions

based on the amino acid indicator sequence has good

efficacy. The efficacy of the proposed predictor was

evaluated by means of accuracy measures such as exonic

peaks, discriminating factor, sensitivity, specificity, ap-

proximate correlation, wrong rate and miss rate which

shows better performance in coding regions detection

when compared to the existing methods. The execution

of reduced sequence due to representation of codons i.e.,

amino acid sequence reduces the computation time to

one-third as compared to the execution of whole se-

quence of original DNA sequence. Again the filtering

technique with amino acid indicator sequence enables to

detect smaller exon regions by showing high peak and

minimizes the power in introns giving more suppression

to the intron regions. Thus the proposed method is not

only fast but also more sensitive.

REFERENCES

[1] Burge, C.B. and Karlin, S. (1998) Finding the genes in

genomic DNA. Current Opinion in Structural Biology, 8,

346-354. doi:10.1016/S0959-440X(98)80069-9

[2] Gusfield, D. (1997) Algorithms on strings, trees, and

sequences: Computer science and computational biolo-

gy. Cambridge University Press, Cambridge.

doi:10.1017/CBO9780511574931

[3] Wang, Z., Chen, Y.Z. and Li, Y.X. (2004) A brief review

of computational gene prediction methods. Genomics

Proteomics Bioinformatics, 2, 216-221.

[4] Fickett, J.W. (1982) Recognition of protein coding re-

gions in DNA sequences. Nucleic Acids Research, 10,

5303-5318. doi:10.1093/nar/10.17.5303

[5] Silverman, B.D. and Linsker, R. (1986) A measure of

DNA periodicity. Journal of Theoretical Biology, 118,

295-300. doi:10.1016/S0022-5193(86)80060-1

[6] Tiwari, S., Ramachandran, S. and Bhattachalya, A. (1997)

Prediction of probable gene by Fourier analysis of geno-

mic sequences. CABIOS, 13, 263-270.

[7] Anastassiou, D. (2000) Frequency-domain analysis of

biomolecular sequences. Bioinformatics, 16, 1073-1081.

doi:10.1093/bioinformatics/16.12.1073

[8] Anastassiou, D. (2001) Genomic Signal Processing.

IEEE Signal Processing Magazine, 8-20.

doi:10.1109/79.939833

[9] Vaidyanathan, P.P. and Yoon, B.J. (2002) Digital filters

for gene prediction applications. Proceedings of the 36th

Asilomar Conference on Signals, Systems and Compu-

ters, 3-6 November 2002, 306-310.

[10] Fuentes, A., Ginori, J. and Abalo, R. (2008) A new pre-

dictor of coding regions in genomic sequences using a

combination of different approaches. International Jour-

nal of Biological, Biomedical and Medical sciences.

[11] Jesus, P., Chalco, M. and Carrer, H. (2008) Identification

of protein coding regions using the modified gabor-

wavelet tranform. IEEE/ACM Transaction on Compu-

tational Biology and Bioinformatics, 5, 198-207.

[12] Galleani, L. and Garello, R. (2010) The minimum

entropy mapping spectrum of a dna sequence. IEEE

Transaction on Information Theory, 56, 771-783.

doi:10.1109/TIT.2009.2037041

[13] Tuqan, J. and Rushdi, A. (2008) A DSP approach for

finding the codon bias in dna sequences. IEEE Journal of

Selected Topics in Signal Processing, 2, 343-356.

doi:10.1109/JSTSP.2008.923851

[14] Akhtar, M., Epps, J. and Ambikairajah, E. (2007) On

DNA numerical representations for period-3 based exon

prediction. Proceedings of IEEE International Workshop

on Genomic Signal Processing and Statistics, Tuusula,

1-4. doi:10.1109/GENSIPS.2007.4365821

[15] Akhtar, M., Epps, J. and Ambikairajah, K. (2008) Signal

processing in sequence analysis:Advances in eukaryotic

gene prediction. IEEE Journal of Selected Topics in

Signal Processing, 2, 310-321.

doi:10.1109/JSTSP.2008.923854

[16] Voss, R. (1992) Evolution of long-range fractal correla-

tions and 1/f noise in DNA base sequences. Physical

Review Letters, 68, 3805-3808.

doi:10.1103/PhysRevLett.68.3805

[17] Zhang, R. and Zhang, C.T. (1994) Z curves, an intuitive

tool for visualizing and analyzing the DNA sequences.

Journal of Biomolecular Structure & Dynamics, 11, 767-

782.

[18] Rushdi, A. and Tuqan, J. (2006) Gene identification

using the Z-curve representation. Proceedings of IEEE

International Conference on Acoustics, Speech and Sig-

nal Processing, Toulouse, 14-19 May 2006, 1024-1027.

[19] Cristea, P.D. (2002) Genetic signal representation and

analysis. Proc. SPIE Conference, International Biomedi-

cal Optics Symposium (BIOS’02), 4623, 77-84.

[20] Brodzik, A.K. and Peters (2005) Symbol-balanced qua-

ternionic periodicity transform for latent pattern detec-

tion in DNA sequences. Proceedings of IEEE Interna-

tional Conference on Acoustics, Speech, and Signal

Processing, 5, 373-376.

[21] Rosen, G.L. (2006) Signal processing for biologically-

inspired gradient source localization and DNA sequence

analysis. Ph.D. Thesis, Georgia Institute of Technology,

Atlanta.

[22] Nair, T.M., Tambe, S.S. and Kulkarni, B.D. (1994)

Application of artificial neural networks for prokaryotic

J. K. Meher et al. / American Journal of Molecular Biology 1 (2011) 79-86

AJMB

transcription terminator prediction. FEBS Letters, 346,

273-277. doi:10.1016/0014-5793(94)00489-7

[23] Nair, A.S. and Sreenathan, S.P. (2006) A coding measure

scheme employing electron-ion interaction pseudopoten-

tial (EIIP). Bioinformation, 1, 197-202.

[24] Nair, A.S. and Sreenathan, S.P. (2006) An improved

digital filtering technique using frequency indicators for

locating exons. Journal of the Computer Society of India,

36.

[25] Burset, M. and Guigo, Â.R. (1996) Evaluation of gene

structure prediction programs. Genomics, 34, 353-367.

doi:10.1006/geno.1996.0298

[26] Rogic, S., Mackworth, A. and Ouellette, F. (2001) Eva-

luation of genefinding programs on mammalian sequen-

ces. Genome Resarch, 11, 817-832.

doi:10.1101/gr.147901

[27] Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyc-

lopedia of genes and genomes. Nucleic Acid Research,

28, 27-30. doi:10.1093/nar/28.1.27

[28] Biju, I. and Gajendra P.S.R. (2004) EGPred: Prediction

of eukaryotic genes using ab initio methods after

combining with sequence similarity approaches. Genome

Research, 14, 1756-1766. doi:10.1101/gr.2524704