apoptosis mechanism and functions of various
ABSTRACT apoptosis proteins, it will be helpful to obtain infor-
mation about their subcellular location. This is
Apoptosis proteins have a central role in thebecause the subcellular location of apoptosis proteins
development and homeostasis of an organism.is closely related to their function [5,6]. It has been
These proteins are very important for under-known that there are 732 archetypical proteins with
standing the mechanism of programmed cellapoptosis domains [7], and only 98 of these pro-
death, and their function is related to theirteins are known to be the apoptosis protein (for more
types. The apoptosis proteins are categorizeddetails, one can visit: http://www.apoptosis-db.org).
into the following four types: (1) Cytoplasmic Scientists usually deal with a number of protein
protein; (2) Plasma membrane-bound protein; sequences already known belonging to apoptosis pro-
(3) Mitochondrial inner and outer proteins; (4) teins. However, it is both time-consuming and costly
Other proteins. A novel method, the Hilbert-to determine which specific subcellular location a
Huang transform, is applied for predicting the given apoptosis protein belongs to. Confronted with
type of a given apoptosis protein with supportsuch a situation, can we develop a fast and effective
vector machine. High success rates wereway to predict the subcellular location for a given
obtained by the re-substitute test (98/98=100%)apoptosis protein based on its amino acid sequence?
and jackknife test (91/98 = 92.9%).Recently, Guo-ping Zhou [7] attempted to identify
the subcellular location of apoptosis proteins accord-
ing to their sequences by means of the covariant
discriminant function, which was established on the
basis of the Mahalanobis distance and Chou's
invariance theorem [7,8,9].The results were quite
1. INTRODUCTIONpromising, indicating that the subcellular location of
Apoptosis, or programmed cell death, is a fundamen-apoptosis proteins are predictable to a considerably
tal process controlling normal tissue homeostasis by accurate extent if a good vector representation of pro-
regulating a balance between cell proliferation and tein can be established. It is expected that, with a con-
death [1]. This process entails the autolytic degrada-tinuous improvement of vector representation meth-
tion of cellular components, and is characterized byods by incorporating amino acid properties, and by
blebbing of cell membranes, shrinkage of cell vol-using more powerful mathematics methods, some the-
umes, and condensation of nuclei [2], and is currentlyory predicting method might eventually become a use-
an area of intense investigation. Cell death andful tool in this area because the function of an
renewal are responsible for maintaining the proper apoptosis protein is closely related to its subcellular
turnover of cells, which ensures a constant controlled location. The present study was initiated in an
flux of fresh cells. Programmed cell death and cellattempt to address this problem.
proliferation are tightly coupled. When apoptosis Chou and Elrod made an extensive research in pre-
malfunctions, a variety of formidable diseases can dicting subcellular location mainly based on the
ensue: blocking apoptosis is associated with cancer amino acid composition. Subsequently, in order to
and autoimmune disease, whereas unwanted take into account the sequence-order effects and
apoptosis can possibly lead to ischemic damage orimproved the prediction quality, Chou has further
neurodegenerative disease [3]. Apoptosis is consid-incorporated the quasi-sequence order effect [5] and
ered to have a key role in these several devastatingintroduced the concept of “pseudo-amino-acid com-
diseases and, in principle, provides many targets forposition[9]. For example, Chou [10] classified mem-
therapeutic intervention [4]. To understand the brane proteins into five different types and proposed
Keywords:Hilbert Huang transform; Sup-
port vector machine; Subcellular location
predict
Hilbert Huang transform for predicting proteins
subcellular location
Hilbert Huang transform for predicting proteins
subcellular location
Feng Shi, Qiu-Jian Chen & Na-na Li
School of Science, Huazhong Agricultural University, Wuhan, Hubei, China. Correspondence should be addressed to Feng Shi
(shifeng@mail.hzau.edu.cn).
J. Biomedical Science and Engineering, 2008, 1, 59-63Scientific
Research
Publishing
JBiSE
Published Online May 2008 in SciRes.http://www.srpublishing.org/journal/jbise
SciRes Copyright ©2008
a covariant discriminant algorithm to predict theand the lower envelop (linked by local minima) are
types of membrane proteins. Recently, Cai et al. [11] zero at every point.
applied neural network to this problem. To improve TheEMDprocessisasfollows.According to
the prediction quality, Chou [5] proposed a newHilbert-Huang transform(HHT)[14], once the
method in which the covariant discriminate algo-extrema of a time seriesx(t) are identified, all the
rithmwasaugmentedtoincorporatethequasi-local maxima and minima are connected by two spe-
sequence-order effect. This method uses the amino cial lines as the upper and lower envelopes respec-
acidcompositionandthesequence-order-couplingtively. Their mean is designated as m, and the differ-
1
numbers (reflecting the sequence order effect) in ence between x(t) andm is x(t)-m =h . If h is not an
1111
order to improve the prediction quality. Feng [12] pro-IMF, h is treated as the data and undergoes the pro-
1
posed a new representation of unified attribute vector,cedure above, thenh-m=h . Repeat this sifting
111 11
that each protein can be represented by a vector,
procedure k times until h is an IMF, that ish-
which is 20-D vector in Hilbert space with unified1k1(k-1)
length. Hence, all of proteins have their representa-m=h, thus the first IMF component is obtained,
1k1k
tive points on the surface of the 20-D globe. The rep-i.e. . Then separate IMF from the original time series
1
resentative points of the proteins in the same familyby x(t)-IMF=r. Treatr as the new data and subject
111
or with the higher sequence identity are closer on theit to the same sifting process above. Repeat this pro-
surface. The overall predictive accuracy could be cedure on all the subsequent r , i.e. r-IMF =r,r-
improved from 3% to 5% for different databases [12] j1222
with this simply modification of the usage of the IMF =r ,,r-IMF=r .
33 n-1nn
amino acid composition. Recently, a series of new So the result is:
powerful approaches have been developed by Chou
and his co-workers [13]. Encouraged by the great suc-
cesses of the previous invertigators in the area, here
we would like to use a different strategy, the support
vector machines, to approach this very important but
also very difficult problem in the hope that our 2.2. Hilbert transform
approach can play a complementary role to the exist-Having obtained the intrinsic mode function compo-
ing methods.nents IMF (denoted as c), one will have no difficulty
ii
in applying the Hilbert transform to each IMF compo-
2. HILBERT HUANG TRANSFORMnent,
The HHT consists of two parts: empirical mode
decomposition (EMD) and Hilbert spectral analysis
(HSA). This method is potentially viable for nonlin-
ear and nonstationary data analysis, especially for
time-frequency-energy representations. It has beenin which the PV indicates the principal value of the
tested and validated exhaustively, but only empiri-singular integral. With the hilbert transform, the ana-
cally. In all the cases studied, the HHT gave results lytic signal is defined as
much sharper than those from any of the traditional
analysis methods in time-frequency-energy represen-
tations.Additionally, the HHT revealed true physicalHere, a(t)is the instantaneous amplitude, and(t)
meanings in many of the data examined. Powerful asii
it is, the method is entirely empirical. In order to is the phase function,
make the method more robust and rigorous, many out-
standing mathematical problems related to the HHT
method need to be resolved. In this section, a brief
introduction to the methodology of the HHT will be
given. Readers interested in the complete details
should consult [14].
and the instantaneous frequency is simply
2.1. The empirical mode decomposition
method (the sifting process)
In this method any time series, including non-linear
and non-stationary series, can be decomposed into a With the Hilbert Spectrum defined, we can also
finite number of intrinsic mode functions (IMFs) definethemarginal spectrum h(w) as
through empirical mode decomposition (EMD) pro-
cess.An IMF is a function which must follow two con-
ditions: (1) the difference between the numbers of
extrema and zero-crossings is of 1 ; and (2) the The marginal spectrum offers a measure of the
mean of the upper envelop (linked by local maxima)total amplitude (or energy) contribution from each
SciRes JBiSE Copyright ©2008
F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63
60
nonstationary processes: it is based on an adaptive
basis; the frequency is derived by differentiation
rather than convolution; therefore, it is not limited by
the uncertainty principle; it is applicable to nonlinear
and nonstationary data and presents the results in
time-frequency-energy space for feature extraction.
Support Vector Machine (SVM) is one type of
learning machines based on statistical learning the-
ory. A complete description to the theory of SVMs for
pattern recognition is in Vapnik's book.[15]. SVMs
have been used in a range of bioinformatics problems
including protein fold recognition [16]; proteinprotein
interactions prediction [17]; prediction of protein
subcellular location [17, 18], protein secondary
structure prediction,T-cell epitopes prediction, Clas-
sification of protein quaternary structure [19].
In this paper, we apply Vapnik's support vector
machine for predicting the types of apoptosis proteins.
We have used the OSU_SVM, a Matlab SVM toolbox
(http://www.ece.osu.edu/~maj/osu_svm), which is an
frequency value. This spectrum represents the accu-implementation of SVM for the problem of pattern rec-
mulated amplitude over the entire data span in aognition.
probabilistic sense.
The combination of the empirical mode decompo-
sition and the Hilbert spectral analysis is also known 3. TRAINING AND PREDICTION
as theHilbert-Huang transform (HHT) for short.According to their subcellular location [12],
Empirically, all tests indicate that HHT is a superiorapoptosis proteins are classified into the following
tool for time-frequency analysis of nonlinear and four types: (1) type I: Cytoplasmic protein; (2) type II:
nonstationary data. It is based on an adaptive basis,Plasma membrane-bound protein; (3) type : Mito-
and the frequency is defined through the Hilbertchondrial inner and outer proteins; (4) type : Other
transform. Consequently, there is no need for the spu-proteins (see).
rious harmonics to represent nonlinear waveformIn this research, we first translate every aminoacid
deformations as in any of the priori basis methods, sequences into a numerical sequencef by hydrophobicity
and there is no uncertainty principle limitation onindex, then, decompose it into a finite number of
time or frequency resolution from the convolution intrinsic mode functions (IMFs) through empirical
pairs based also on a priori basis.mode decomposition (EMD) process, we just select
A comparative summary of Fourier, wavelet andthe 2nd to 4th components (IMF2, IMF3, IMF4),
HHT analyses is given in the:because first IMF just reflects the rand composition
This table shows that the HHT is indeed a powerfuland the last is just the trendences composition of the
method for analyzing data from nonlinear and numerical sequence f. Then applying the Hilbert
Table 2
Table1
SciRes JBiSE Copyright ©2008
61
F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63
Type I
NP_033941, NP_033940, NP_033939,
NP_031637, NP_031570, NP_031563,
NP_031490, NP_033447, , NP_036246,
NP_001218, NP_004041, NP_065209,
NP_001151, NP_071610, NP_071567,
NP_066961, NP_037054, NP_036894,
NP_005649, NP_004392, NP_004315,
NP_001187, NP_001159, NP_001157,
NP_001156, P55212, P42574, P39429,
P55867, P22366, P55866, P55214,
P55269, P29466, P55865, P29452,
Q02357, O54786, Q60989, Q62210,
Q60431, O70201, XP_013050,
Type
proteins
Type II
NP_037223, NP_037275,
NP_032013, NP_032612,
NP_037315, NP_005916,
NP_005579, NP_000034,
NP_001056, NP_003781,
NP_002498, NP_036742,
NP_031553, NP_031549,
P50555, P25118, P18519,
P51867, O19131, Q63199,
O77736, , O02703, Q13014,
Q63690, Q07820, Q91828,
Q91827, Q07812, P28825,
NP_001179
Type III
P10417, P53563,
Q07816, P49950,
Q07817, O95831,
Q9OX1, Q9JM53,
Q9VQ79, O77737,
Q00709,
XP_008738,
NP_033873,
Type IV
Q63369,
Q90660,
Q00653,
Q04861,
P19838,
NP_032715,
P98150,
Q15121,
Q62048,
NP_033872,
NP_004040,
NP_005736
a.Derived from SWISS-PROT data bank.
b.Of the 12 other apoptosis proteins, five are located in nucleus, two in endoplasmic reticulum, one in microtubule, and one in lysosome [7].
Table 2.List of the acession numbers for the 98 apoptosis proteins classified into four categories according to their
subcellular locations. (Type I: 43 Cytoplasmic proteins; Type II: 30 Plasma membrane-bound proteins; Type III: Mitochondrial
inner and outer proteins ; Type IV: 12 Other proteins).
Basis
Frequency
Presentation
Nonlinear
Nonstationary
Feature
Extraction
Theoretical
base
Wavelet
a priori
convolution:
regional
uncertainty
energy-time-
frequency
no
yes
discrete: no
continuous:
yes
theory
complete
Fourier
A priori
Convolution:
global
Uncertainty
energy
-frequency
no
no
no
theory
complete
Hilbert
adaptive
differentiation
local,
certainty
energy-time-
frequency
yes
yes
yes
empirical
Table 1. Comparative summary of Fourier, Wavelet and HHT
analyses.
transform to each IMF component, we get the instan-When the re-substitution test was performed for the
taneous amplitudea(t),thengettheenergy valuecurrent study, the type of each apoptosis protein in a
idata set was in turn identified using the rule parame-
e= , (t=2, 3, 4). Next, get its energy ratio
iters derived from the same data set, the so-called
training data set. As shown in, the overall suc-
.Last every protein was represented as a
cess rate thus obtained for the 98 apoptosis proteins
point or a vector in a 23-D space. The first 20 compo-in was 100%, indicating an excellent self-
nents of its vector were supposed to be the occur-consistency.
rence frequencies of the 20 amino acids in the protein However, during the process of the re-substitution
concerned, the last three components were its energytest, the rule parameters derived from the training
ratio times a weight, there, we set the weight is 0.2. data set include the information of the query protein
The computations were carried out on a PC. Alsolater plugged back in the test. This will certainly
for the SVM, the width of the Gaussian RBFs is underestimate the error and enhance the success rate
selected as that which minimized an estimate of the because the same proteins are used to derive the rule
VC-dimension.After being trained, the hyper-plane parameters and to test themselves. Nevertheless, the
output by the SVM was obtained. The SVM method is re-substitution test is absolutely necessary because it
applied to two-class problems. In this paper, for the reflects the self-consistency of a prediction method,
four-class problems, we have used a simple andespecially for its algorithm part. A prediction algo-
effective method:one-against-others method [16]rithm certainly cannot be deemed as a good one if its
to transfer it into two-class problems. We first test the self-consistency is poor. In other words, the re-
selfconsistency and leave-one-out cross-validationsubstitution test is necessary but not sufficient for
(jackknife test) of the method, followed by testingevaluating a prediction method. As a complement, a
the method by prediction of an independent dataset. cross-validation test for an independent testing data
As a result, the rates of self-consistency, cross-set is needed because it can reflect the effectiveness
validation of prediction were quite high.of a prediction method in practical application. This
In addition to the prediction algorithm, we also is important especially for checking the validity of a
need to construct a training data set to complete thetraining data set-whether it contains sufficient infor-
establishment of a statistical prediction method. Tomation to reflect all the important features concerned
realize this, based on the SWISS-PROT data bank, 98 so as to field a high success rate in application.
apoptosis proteins (the date were taken from Zhou [7])
were classified into the following four subcellular
locations: (1) cytoplasmic, (2) plasma membrane-4.2. Jackknife test
bound, (3) mitochondrial, and (4) other ().As is well known, the independent data set test, sub-
sampling test, and jackknife test are the three meth-
4RESULTS AND DISCUSSIONods often used for cross-validation in statistical pre-
By means of the SVM algorithm described in the lastdiction. Among these three, however, the jackknife
section, a statistical prediction was performed for the test is deemed as the most effective and objective one
98 apoptosis proteins listed in. The predic-for a comprehensive discussion about this). During
tion was conducted by two different approaches, the jackknifing, each protein in the data set is in turn sin-
re-substitution test and the jackknife test. The resultsgled out as a tested protein and all the rule parameters
aregiven in.are calculated based on the remaining proteins. In
other words, the subcellular location of each
apoptosis protein is identified by the rule parameters
4.1. Re-substitution testderived using all the other apoptosis proteins except
The so-called re-substitution test is an examination the one that is being identified. During the process of
for the self-consistency of a prediction method[7].
yi
Las ever
ti Table 3
Table 1
Table 1
Table 2
Table 3
Test method
Success Rate
Re-substitute
Jack-knife
covariant
SVM
HHT
covariant
SVM
HHT
Type
43/43=100%
42/43=97.70%
43/43=100%
42/43=97.7%
39/43=91.4%
41/43=95.3%
Type
30/30=100%
30/30=100%
30/30=100%
22/30=73.3%
28/30=93.3%
29/30=96.7%
Type
9/13=60.2%
13/13=100%
13/13=100%
4/13=30.8%
12/13=92.5%
12/13=96.7%
Type
7/12=58.3%
12/12=100%
12/12=100%
3/12=25.0%
9/12=75.0%
9/12=75.7
Overall
89/98=90.8%
97/98=99.0%
98/98=100%
71/98=72.5%
88/98=89.8%
91/98=92.9%
SciRes JBiSE Copyright ©2008
62F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63
Table 3.Tested results for the 98 apoptosis prtoeins in Table 2 by both Re-substitution test and Jackknife test.All use Gauss
RBF kernel function, while the value C =15, and the gama= 80.
[8]Chou, K C. A. novel approach to predicting protein structural
jackknifing, both the training data set and testingclasses in a (20-1)-D amino acid composition space.
data set are actually open, and a protein will in turn Proteins:Structure, Function and Genetics 1995, 21: 319-344.
move from one to the other.As expected, the success [9]Chou, K. C. Prediction of protein cellular attributes using
prediction rates by jackknife test were decreased in pseudo-amino-acid-composition. Proteins :Structure, Function,
and Genetics 2001, 43:246-255 (Erratum:ibid., 2001, vol. 44,
comparison with those by the re-substitution test. 60).
Such a decrement is particularly more remarkable for[10]Chou, J. J., Li, H., Salvesen G.S., Yuan, J. & G. Wagner. Solu-
small subsets. This is because the cluster-tolerant tion structure of BID, an intracellular amplifier of apoptotic
capacity for small subsets is usually low. And hence signaling. Cell 1999, 96:615-624.
[11]Cai,Y. D., Liu, X. J. & Chou, K. C. Artificial neural network
the information loss resulting from jackknifing will model for predicting membrane protein types. J. Biomol.
have a greater impact on the small subsets than theStruct. Dyn. 2001, 18:607-610.
large ones. Nevertheless, as shown in, the[12]Feng, Z. P. Prediction of the subcellular location of
overall jackknife rate for the data set of the 98prokaryotic proteins based on a new representation of the
amino acid composition. Biopolymers 2001, 58:491-499.
apoptosis proteins could still reach 93%. It is [13]Cai, Y. D. & Chou, C. Nearest neighbour algorithm for predict-
expected that the success rate for identifying theing protein subcellular location by combing functional
subcellular location of apoptosis proteins can be fur-domain composition and pseudo-amino acid composition.
ther enhanced by improving the training data of smallBiochem Biophys Res Comm. 2003, 305:407-411.
[14]Huang, N. E., Shen, Z., Long, S. R., Wu, M. L., H.H. Shih,
subsets by adding into them more new proteins thatZheng, Q., N.C. Yen, C.C. Tung & Liu, H. H. The empirical
have been found belonging to the subcellular locationmode decomposition and Hilbert spectrum for nonlinear and
defined by these subsets.nonstationary time series analysis. Proc. Roy. Soc. London A
1998, 454:903-995.
[15]Vapnik V. Statistical Learning Theory. WileyInterscience
5 CONCLUSIONS1998.
The above results, together with those obtained by[16]Ding, C. H. & I. Dubchak. Multiclass protein fold recognition
using support vector machines and neural networks.
the covariant discriminant prediction algorithm [7],Bioinformatics 2001, 17:349-358.
have indicated that the types of apoptosis proteins are [17]Cai, Y. D., Liu, X. J., Xu, X. B.& Chou, K. C. Support vector
predictable with a considerable accuracy. It is antici-machines for prediction of protein subcellular location by
pated that the HHT, and the SVM, if effectively com-incorporating quasisequenceorder effect. J. Cell. Biochem.
2002, 84:343-348.
plemented with each other, will become a powerful[18]Hua, S. J. & Sun, Z. R. Support vector machine approach for
tool for predicting the types of apoptosis proteins.protein subcellular localization prediction. Bioinformatics
The current study has further demonstrated that the2001, 17:721-728.
datasets originally constructed by Zhou[7] will be [19]Hua, S. J. & Sun, Z. R. A novel method of protein secondary
structure prediction with high segment overlap measure: sup-
very useful for the area of apoptosis study. It isport vector machine approach. J. Mol. Biol. 2001, 308:397-
expected that the prediction quality can be further 407.
improved if the current HHT can be properly com-
bined with pseudoamino acid composition[9] and
function domine composition and with other amino
acid properties.
ACKNOWLEDGMENTS
The authors thank Dr. Guo-Ping Zhou for providing the amino acid
sequences of the apoptosis proteins and some helpful discussions.
The work was partly supported by Huazhong Agricultural Univer-
sity, P. R. China
REFERENCE
[1]Zhou, P., Chou, J. J., Olea RS, Yuan, J. & G. Wagner. Solution
structure of Apaf-1 CARD and its interaction with caspase-9
CARD: a structural basis for specific adaptor/caspase interac-
tion. Proc Natl Acad Sci USA 1999, 96:11265-11270.
[2]Kerr J.F., Wyllie A. H. & A. R. Currie. Apoptosis: a basic bio-
logical phenomenon with wide-ranging implications in tissue
kinetics. Br J Cancer 1972, 26:239-257.
[3]Schulz J. B., Weller M. & M. A. Moskowitz. Caspases as treat-
ment targets in stroke and neurodegenerative diseases. Ann
Neurol 1999, 45:421-429.
[4]Barinaga M. Stroke-damaged neurons may commit cellular sui-
cide. Science 1998, 281:1302-1303.
[5]Chou, K. C. A new branch of proteomics: prediction of protein
cellular attributes. Gene Cloning and Expression Technologies
2002, 4:57-70,
[6]Huang, J. & Shi, F. Support vector machines for prodicting
apoptosis proteins types. Acta bioinformatics 2005, 53:39-47.
[7]Zhou, G. P. & Doctor. K. Subcelluar location of Apoptosis pro-
teins. Proteins:Structure, Function, and Genetic 2003, 50:44-48.
Table 2
SciRes JBiSE Copyright ©2008
63
F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63