Hilbert Huang Transform for Predicting Proteins Subcellular Location

doi:10.4236/jbise.2008.11009

Paper Menu >>

Journal Menu >>

apoptosis mechanism and functions of various

ABSTRACT apoptosis proteins, it will be helpful to obtain infor-

mation about their subcellular location. This is

Apoptosis proteins have a central role in thebecause the subcellular location of apoptosis proteins

development and homeostasis of an organism.is closely related to their function [5,6]. It has been

These proteins are very important for under-known that there are 732 archetypical proteins with

standing the mechanism of programmed cell“apoptosis” domains [7], and only 98 of these pro-

death, and their function is related to theirteins are known to be the apoptosis protein (for more

types. The apoptosis proteins are categorizeddetails, one can visit: http://www.apoptosis-db.org).

into the following four types: (1) Cytoplasmic Scientists usually deal with a number of protein

protein; (2) Plasma membrane-bound protein; sequences already known belonging to apoptosis pro-

(3) Mitochondrial inner and outer proteins; (4) teins. However, it is both time-consuming and costly

Other proteins. A novel method, the Hilbert-to determine which specific subcellular location a

Huang transform, is applied for predicting the given apoptosis protein belongs to. Confronted with

type of a given apoptosis protein with supportsuch a situation, can we develop a fast and effective

vector machine. High success rates wereway to predict the subcellular location for a given

obtained by the re-substitute test (98/98=100%)apoptosis protein based on its amino acid sequence?

and jackknife test (91/98 = 92.9%).Recently, Guo-ping Zhou [7] attempted to identify

the subcellular location of apoptosis proteins accord-

ing to their sequences by means of the covariant

discriminant function, which was established on the

basis of the Mahalanobis distance and Chou's

invariance theorem [7,8,9].The results were quite

1. INTRODUCTIONpromising, indicating that the subcellular location of

Apoptosis, or programmed cell death, is a fundamen-apoptosis proteins are predictable to a considerably

tal process controlling normal tissue homeostasis by accurate extent if a good vector representation of pro-

regulating a balance between cell proliferation and tein can be established. It is expected that, with a con-

death [1]. This process entails the autolytic degrada-tinuous improvement of vector representation meth-

tion of cellular components, and is characterized byods by incorporating amino acid properties, and by

blebbing of cell membranes, shrinkage of cell vol-using more powerful mathematics methods, some the-

umes, and condensation of nuclei [2], and is currentlyory predicting method might eventually become a use-

an area of intense investigation. Cell death andful tool in this area because the function of an

renewal are responsible for maintaining the proper apoptosis protein is closely related to its subcellular

turnover of cells, which ensures a constant controlled location. The present study was initiated in an

flux of fresh cells. Programmed cell death and cellattempt to address this problem.

proliferation are tightly coupled. When apoptosis Chou and Elrod made an extensive research in pre-

malfunctions, a variety of formidable diseases can dicting subcellular location mainly based on the

ensue: blocking apoptosis is associated with cancer amino acid composition. Subsequently, in order to

and autoimmune disease, whereas unwanted take into account the sequence-order effects and

apoptosis can possibly lead to ischemic damage orimproved the prediction quality, Chou has further

neurodegenerative disease [3]. Apoptosis is consid-incorporated the quasi-sequence order effect [5] and

ered to have a key role in these several devastatingintroduced the concept of “pseudo-amino-acid com-

diseases and, in principle, provides many targets forposition”[9]. For example, Chou [10] classified mem-

therapeutic intervention [4]. To understand the brane proteins into five different types and proposed

Keywords:Hilbert Huang transform; Sup-

port vector machine; Subcellular location

predict

Hilbert Huang transform for predicting proteins

subcellular location

Hilbert Huang transform for predicting proteins

subcellular location

Feng Shi, Qiu-Jian Chen & Na-na Li

School of Science, Huazhong Agricultural University, Wuhan, Hubei, China. Correspondence should be addressed to Feng Shi

(shifeng@mail.hzau.edu.cn).

J. Biomedical Science and Engineering, 2008, 1, 59-63Scientific

Research

Publishing

JBiSE

Published Online May 2008 in SciRes.http://www.srpublishing.org/journal/jbise

a covariant discriminant algorithm to predict theand the lower envelop (linked by local minima) are

types of membrane proteins. Recently, Cai et al. [11] zero at every point.

applied neural network to this problem. To improve TheEMDprocessisasfollows.According to

the prediction quality, Chou [5] proposed a newHilbert-Huang transform(HHT)[14], once the

method in which the covariant discriminate algo-extrema of a time seriesx(t) are identified, all the

rithmwasaugmentedtoincorporatethequasi-local maxima and minima are connected by two spe-

sequence-order effect. This method uses the amino cial lines as the upper and lower envelopes respec-

acidcompositionandthesequence-order-couplingtively. Their mean is designated as m, and the differ-

numbers (reflecting the sequence order effect) in ence between x(t) andm is x(t)-m =h . If h is not an

1111

order to improve the prediction quality. Feng [12] pro-IMF, h is treated as the data and undergoes the pro-

posed a new representation of unified attribute vector,cedure above, thenh-m=h . Repeat this sifting

111 11

that each protein can be represented by a vector,

procedure k times until h is an IMF, that ish-

which is 20-D vector in Hilbert space with unified1k1(k-1)

length. Hence, all of proteins have their representa-m=h, thus the first IMF component is obtained,

1k1k

tive points on the surface of the 20-D globe. The rep-i.e. . Then separate IMF from the original time series

resentative points of the proteins in the same familyby x(t)-IMF=r. Treatr as the new data and subject

111

or with the higher sequence identity are closer on theit to the same sifting process above. Repeat this pro-

surface. The overall predictive accuracy could be cedure on all the subsequent r , i.e. r-IMF =r,r-

improved from 3% to 5% for different databases [12] j1222

with this simply modification of the usage of the IMF =r ,,r-IMF=r .

33 n-1nn

amino acid composition. Recently, a series of new So the result is:

powerful approaches have been developed by Chou

and his co-workers [13]. Encouraged by the great suc-

cesses of the previous invertigators in the area, here

we would like to use a different strategy, the support

vector machines, to approach this very important but

also very difficult problem in the hope that our 2.2. Hilbert transform

approach can play a complementary role to the exist-Having obtained the intrinsic mode function compo-

ing methods.nents IMF (denoted as c), one will have no difficulty

in applying the Hilbert transform to each IMF compo-

2. HILBERT HUANG TRANSFORMnent,

The HHT consists of two parts: empirical mode

decomposition (EMD) and Hilbert spectral analysis

(HSA). This method is potentially viable for nonlin-

ear and nonstationary data analysis, especially for

time-frequency-energy representations. It has beenin which the PV indicates the principal value of the

tested and validated exhaustively, but only empiri-singular integral. With the hilbert transform, the ana-

cally. In all the cases studied, the HHT gave results lytic signal is defined as

much sharper than those from any of the traditional

analysis methods in time-frequency-energy represen-

tations.Additionally, the HHT revealed true physicalHere, a(t)is the instantaneous amplitude, and(t)

meanings in many of the data examined. Powerful asii

it is, the method is entirely empirical. In order to is the phase function,

make the method more robust and rigorous, many out-

standing mathematical problems related to the HHT

method need to be resolved. In this section, a brief

introduction to the methodology of the HHT will be

given. Readers interested in the complete details

should consult [14].

and the instantaneous frequency is simply

2.1. The empirical mode decomposition

method (the sifting process)

In this method any time series, including non-linear

and non-stationary series, can be decomposed into a With the Hilbert Spectrum defined, we can also

finite number of intrinsic mode functions (IMFs) definethemarginal spectrum h(w) as

through empirical mode decomposition (EMD) pro-

cess.An IMF is a function which must follow two con-

ditions: (1) the difference between the numbers of

extrema and zero-crossings is of 1 ; and (2) the The marginal spectrum offers a measure of the

mean of the upper envelop (linked by local maxima)total amplitude (or energy) contribution from each

F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63

nonstationary processes: it is based on an adaptive

basis; the frequency is derived by differentiation

rather than convolution; therefore, it is not limited by

the uncertainty principle; it is applicable to nonlinear

and nonstationary data and presents the results in

time-frequency-energy space for feature extraction.

Support Vector Machine (SVM) is one type of

learning machines based on statistical learning the-

ory. A complete description to the theory of SVMs for

pattern recognition is in Vapnik's book.[15]. SVMs

have been used in a range of bioinformatics problems

including protein fold recognition [16]; proteinprotein

interactions prediction [17]; prediction of protein

subcellular location [17, 18], protein secondary

structure prediction,T-cell epitopes prediction, Clas-

sification of protein quaternary structure [19].

In this paper, we apply Vapnik's support vector

machine for predicting the types of apoptosis proteins.

We have used the OSU_SVM, a Matlab SVM toolbox

(http://www.ece.osu.edu/~maj/osu_svm), which is an

frequency value. This spectrum represents the accu-implementation of SVM for the problem of pattern rec-

mulated amplitude over the entire data span in aognition.

probabilistic sense.

The combination of the empirical mode decompo-

sition and the Hilbert spectral analysis is also known 3. TRAINING AND PREDICTION

as the“Hilbert-Huang transform” (HHT) for short.According to their subcellular location [12],

Empirically, all tests indicate that HHT is a superiorapoptosis proteins are classified into the following

tool for time-frequency analysis of nonlinear and four types: (1) type I: Cytoplasmic protein; (2) type II:

nonstationary data. It is based on an adaptive basis,Plasma membrane-bound protein; (3) type : Mito-

and the frequency is defined through the Hilbertchondrial inner and outer proteins; (4) type : Other

transform. Consequently, there is no need for the spu-proteins (see).

rious harmonics to represent nonlinear waveformIn this research, we first translate every aminoacid

deformations as in any of the priori basis methods, sequences into a numerical sequencef by hydrophobicity

and there is no uncertainty principle limitation onindex, then, decompose it into a finite number of

time or frequency resolution from the convolution intrinsic mode functions (IMFs) through empirical

pairs based also on a priori basis.mode decomposition (EMD) process, we just select

A comparative summary of Fourier, wavelet andthe 2nd to 4th components (IMF2, IMF3, IMF4),

HHT analyses is given in the:because first IMF just reflects the rand composition

This table shows that the HHT is indeed a powerfuland the last is just the trendences composition of the

method for analyzing data from nonlinear and numerical sequence f. Then applying the Hilbert

Table 2

Table1

F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63

Type I

NP_033941, NP_033940, NP_033939,

NP_031637, NP_031570, NP_031563,

NP_031490, NP_033447, , NP_036246,

NP_001218, NP_004041, NP_065209,

NP_001151, NP_071610, NP_071567,

NP_066961, NP_037054, NP_036894,

NP_005649, NP_004392, NP_004315,

NP_001187, NP_001159, NP_001157,

NP_001156, P55212, P42574, P39429,

P55867, P22366, P55866, P55214,

P55269, P29466, P55865, P29452,

Q02357, O54786, Q60989, Q62210,

Q60431, O70201, XP_013050,

Type

proteins

Type II

NP_037223, NP_037275,

NP_032013, NP_032612,

NP_037315, NP_005916,

NP_005579, NP_000034,

NP_001056, NP_003781,

NP_002498, NP_036742,

NP_031553, NP_031549,

P50555, P25118, P18519,

P51867, O19131, Q63199,

O77736, , O02703, Q13014,

Q63690, Q07820, Q91828,

Q91827, Q07812, P28825,

NP_001179

Type III

P10417, P53563,

Q07816, P49950,

Q07817, O95831,

Q9OX1, Q9JM53,

Q9VQ79, O77737,

Q00709,

XP_008738,

NP_033873,

Type IV

Q63369,

Q90660,

Q00653,

Q04861,

P19838,

NP_032715,

P98150,

Q15121,

Q62048,

NP_033872,

NP_004040,

NP_005736

a.Derived from SWISS-PROT data bank.

b.Of the 12 other apoptosis proteins, five are located in nucleus, two in endoplasmic reticulum, one in microtubule, and one in lysosome [7].

Table 2.List of the acession numbers for the 98 apoptosis proteins classified into four categories according to their

subcellular locations. (Type I: 43 Cytoplasmic proteins; Type II: 30 Plasma membrane-bound proteins; Type III: Mitochondrial

inner and outer proteins ; Type IV: 12 Other proteins).

Basis

Frequency

Presentation

Nonlinear

Nonstationary

Feature

Extraction

Theoretical

base

Wavelet

a priori

convolution:

regional

uncertainty

energy-time-

frequency

yes

discrete: no

continuous:

yes

theory

complete

Fourier

A priori

Convolution:

global

Uncertainty

energy

-frequency

theory

complete

Hilbert

adaptive

differentiation

local,

certainty

energy-time-

frequency

yes

empirical

Table 1. Comparative summary of Fourier, Wavelet and HHT

analyses.

transform to each IMF component, we get the instan-When the re-substitution test was performed for the

taneous amplitudea(t),thengettheenergy valuecurrent study, the type of each apoptosis protein in a

idata set was in turn identified using the rule parame-

e= , (t=2, 3, 4). Next, get its energy ratio

iters derived from the same data set, the so-called

training data set. As shown in, the overall suc-

.Last every protein was represented as a

cess rate thus obtained for the 98 apoptosis proteins

point or a vector in a 23-D space. The first 20 compo-in was 100%, indicating an excellent self-

nents of its vector were supposed to be the occur-consistency.

rence frequencies of the 20 amino acids in the protein However, during the process of the re-substitution

concerned, the last three components were its energytest, the rule parameters derived from the training

ratio times a weight, there, we set the weight is 0.2. data set include the information of the query protein

The computations were carried out on a PC. Alsolater plugged back in the test. This will certainly

for the SVM, the width of the Gaussian RBFs is underestimate the error and enhance the success rate

selected as that which minimized an estimate of the because the same proteins are used to derive the rule

VC-dimension.After being trained, the hyper-plane parameters and to test themselves. Nevertheless, the

output by the SVM was obtained. The SVM method is re-substitution test is absolutely necessary because it

applied to two-class problems. In this paper, for the reflects the self-consistency of a prediction method,

four-class problems, we have used a simple andespecially for its algorithm part. A prediction algo-

effective method:“one-against-others” method [16]rithm certainly cannot be deemed as a good one if its

to transfer it into two-class problems. We first test the self-consistency is poor. In other words, the re-

selfconsistency and leave-one-out cross-validationsubstitution test is necessary but not sufficient for

(jackknife test) of the method, followed by testingevaluating a prediction method. As a complement, a

the method by prediction of an independent dataset. cross-validation test for an independent testing data

As a result, the rates of self-consistency, cross-set is needed because it can reflect the effectiveness

validation of prediction were quite high.of a prediction method in practical application. This

In addition to the prediction algorithm, we also is important especially for checking the validity of a

need to construct a training data set to complete thetraining data set-whether it contains sufficient infor-

establishment of a statistical prediction method. Tomation to reflect all the important features concerned

realize this, based on the SWISS-PROT data bank, 98 so as to field a high success rate in application.

apoptosis proteins (the date were taken from Zhou [7])

were classified into the following four subcellular

locations: (1) cytoplasmic, (2) plasma membrane-4.2. Jackknife test

bound, (3) mitochondrial, and (4) other ().As is well known, the independent data set test, sub-

sampling test, and jackknife test are the three meth-

4RESULTS AND DISCUSSIONods often used for cross-validation in statistical pre-

By means of the SVM algorithm described in the lastdiction. Among these three, however, the jackknife

section, a statistical prediction was performed for the test is deemed as the most effective and objective one

98 apoptosis proteins listed in. The predic-for a comprehensive discussion about this). During

tion was conducted by two different approaches, the jackknifing, each protein in the data set is in turn sin-

re-substitution test and the jackknife test. The resultsgled out as a tested protein and all the rule parameters

aregiven in.are calculated based on the remaining proteins. In

other words, the subcellular location of each

apoptosis protein is identified by the rule parameters

4.1. Re-substitution testderived using all the other apoptosis proteins except

The so-called re-substitution test is an examination the one that is being identified. During the process of

for the self-consistency of a prediction method[7].

Las ever

ti Table 3

Table 1

Table 2

Table 3

Test method

Success Rate

Re-substitute

Jack-knife

covariant

SVM

HHT

covariant

SVM

HHT

Type

43/43=100%

42/43=97.70%

43/43=100%

42/43=97.7%

39/43=91.4%

41/43=95.3%

Type

30/30=100%

22/30=73.3%

28/30=93.3%

29/30=96.7%

Type

9/13=60.2%

13/13=100%

4/13=30.8%

12/13=92.5%

12/13=96.7%

Type

7/12=58.3%

12/12=100%

3/12=25.0%

9/12=75.0%

9/12=75.7

Overall

89/98=90.8%

97/98=99.0%

98/98=100%

71/98=72.5%

88/98=89.8%

91/98=92.9%

62F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63

Table 3.Tested results for the 98 apoptosis prtoeins in Table 2 by both Re-substitution test and Jackknife test.All use Gauss

RBF kernel function, while the value C =15, and the gama= 80.

[8]Chou, K C. A. novel approach to predicting protein structural

jackknifing, both the training data set and testingclasses in a (20-1)-D amino acid composition space.

data set are actually open, and a protein will in turn Proteins:Structure, Function and Genetics 1995, 21: 319-344.

move from one to the other.As expected, the success [9]Chou, K. C. Prediction of protein cellular attributes using

prediction rates by jackknife test were decreased in pseudo-amino-acid-composition. Proteins :Structure, Function,

and Genetics 2001, 43:246-255 (Erratum:ibid., 2001, vol. 44,

comparison with those by the re-substitution test. 60).

Such a decrement is particularly more remarkable for[10]Chou, J. J., Li, H., Salvesen G.S., Yuan, J. & G. Wagner. Solu-

small subsets. This is because the cluster-tolerant tion structure of BID, an intracellular amplifier of apoptotic

capacity for small subsets is usually low. And hence signaling. Cell 1999, 96:615-624.

[11]Cai,Y. D., Liu, X. J. & Chou, K. C. Artificial neural network

the information loss resulting from jackknifing will model for predicting membrane protein types. J. Biomol.

have a greater impact on the small subsets than theStruct. Dyn. 2001, 18:607-610.

large ones. Nevertheless, as shown in, the[12]Feng, Z. P. Prediction of the subcellular location of

overall jackknife rate for the data set of the 98prokaryotic proteins based on a new representation of the

amino acid composition. Biopolymers 2001, 58:491-499.

apoptosis proteins could still reach 93%. It is [13]Cai, Y. D. & Chou, C. Nearest neighbour algorithm for predict-

expected that the success rate for identifying theing protein subcellular location by combing functional

subcellular location of apoptosis proteins can be fur-domain composition and pseudo-amino acid composition.

ther enhanced by improving the training data of smallBiochem Biophys Res Comm. 2003, 305:407-411.

[14]Huang, N. E., Shen, Z., Long, S. R., Wu, M. L., H.H. Shih,

subsets by adding into them more new proteins thatZheng, Q., N.C. Yen, C.C. Tung & Liu, H. H. The empirical

have been found belonging to the subcellular locationmode decomposition and Hilbert spectrum for nonlinear and

defined by these subsets.nonstationary time series analysis. Proc. Roy. Soc. London A

1998, 454:903-995.

[15]Vapnik V. Statistical Learning Theory. WileyInterscience

5 CONCLUSIONS1998.

The above results, together with those obtained by[16]Ding, C. H. & I. Dubchak. Multiclass protein fold recognition

using support vector machines and neural networks.

the covariant discriminant prediction algorithm [7],Bioinformatics 2001, 17:349-358.

have indicated that the types of apoptosis proteins are [17]Cai, Y. D., Liu, X. J., Xu, X. B.& Chou, K. C. Support vector

predictable with a considerable accuracy. It is antici-machines for prediction of protein subcellular location by

pated that the HHT, and the SVM, if effectively com-incorporating quasisequenceorder effect. J. Cell. Biochem.

2002, 84:343-348.

plemented with each other, will become a powerful[18]Hua, S. J. & Sun, Z. R. Support vector machine approach for

tool for predicting the types of apoptosis proteins.protein subcellular localization prediction. Bioinformatics

The current study has further demonstrated that the2001, 17:721-728.

datasets originally constructed by Zhou[7] will be [19]Hua, S. J. & Sun, Z. R. A novel method of protein secondary

structure prediction with high segment overlap measure: sup-

very useful for the area of apoptosis study. It isport vector machine approach. J. Mol. Biol. 2001, 308:397-

expected that the prediction quality can be further 407.

improved if the current HHT can be properly com-

bined with pseudoamino acid composition[9] and

function domine composition and with other amino

acid properties.

ACKNOWLEDGMENTS

The authors thank Dr. Guo-Ping Zhou for providing the amino acid

sequences of the apoptosis proteins and some helpful discussions.

The work was partly supported by Huazhong Agricultural Univer-

sity, P. R. China

REFERENCE

[1]Zhou, P., Chou, J. J., Olea RS, Yuan, J. & G. Wagner. Solution

structure of Apaf-1 CARD and its interaction with caspase-9

CARD: a structural basis for specific adaptor/caspase interac-

tion. Proc Natl Acad Sci USA 1999, 96:11265-11270.

[2]Kerr J.F., Wyllie A. H. & A. R. Currie. Apoptosis: a basic bio-

logical phenomenon with wide-ranging implications in tissue

kinetics. Br J Cancer 1972, 26:239-257.

[3]Schulz J. B., Weller M. & M. A. Moskowitz. Caspases as treat-

ment targets in stroke and neurodegenerative diseases. Ann

Neurol 1999, 45:421-429.

[4]Barinaga M. Stroke-damaged neurons may commit cellular sui-

cide. Science 1998, 281:1302-1303.

[5]Chou, K. C. A new branch of proteomics: prediction of protein

cellular attributes. Gene Cloning and Expression Technologies

2002, 4:57-70,

[6]Huang, J. & Shi, F. Support vector machines for prodicting

apoptosis proteins types. Acta bioinformatics 2005, 53:39-47.

[7]Zhou, G. P. & Doctor. K. Subcelluar location of Apoptosis pro-

teins. Proteins:Structure, Function, and Genetic 2003, 50:44-48.

Table 2

F. Shi et al./J. Biomedical Science and Engineering 1 (2008) 59-63