Identification of Deleterious Single Amino Acid Polymorphism Using Sequence Information Based on Feature Selection and Parameter Optimization

doi:10.4236/eng.2013.510B097

Paper Menu >>

Journal Menu >>

Engineering, 2013, 5, 472-476

http://dx.doi.org/10.4236/eng.2013.510B097 Published Online October 2013 (http://www.scirp.org/journal/eng)

Identification of Deleterious Single Amino Acid

Polymorphism Using Sequence Information Based on

Feature Selection and Parameter Optimization

Xiao Chen, Qinke Peng, Jia Lv

Systems Engineering Institute of Electronic and Information Engineering School, Xi’an Jiaotong University, Xi’an, China

Email: xchen_xjtu@163.com, qkepeng@xjtu.edu.cn, lvjia2008@yahoo.cn

Received 2013

Abstract

Most of the human genetic variations are single nucleotide polymorphisms (SNPs), and among them, non-synonymous

SNPs, also known as SAPs, attract extensive interest. SAPs can b e neural or disease associated. Many studies have been

done to distinguish deleterious SAPs from neutral ones. Since many previous studies were based on both structural and

sequence features of the SAP, these methods are not applicable when protein structures are not available. In the current

paper, we developed a method based on UMDA and SVM using protein sequence information to predict SAP’s disease

association. We extracted a set of features that are independent of protein structure for each SAP. Then a SVM-based

machine-learning classifier that used grid search to tune parameters was applied to predict the possible disease asso cia-

tion of SAPs. The SVM method reaches good prediction accuracy. Since the input data of SVM contain irrelevant and

noisy features and parameters of SVM also affect the prediction performance, we introduced UMDA-based wrapper

approach to search for the ‘best’ solution. The UMDA-based method greatly improved prediction performance. Com-

pared with current method, our method achieved better performance.

Keywords: Single Amino Acid Polymorphisms; Support Vector Machine; Univariate Marginal Distribution Algorithm

1. Introduction

With the completion of the human genome project, more

and more single nucleotide polymorphisms (SNPs) are

collected. It’s estimated that around 90% of human ge-

netic variations are SNPs [1], among them, the single

amino acid polymorphisms (SAPs), also known as non-

synonymous SNPs or nsSNPs, cause amino acid substi-

tutions in the protein product, and they have the potential

to affect protein structure and function. Some of the

SAPs won’t make any change in phenotype, so we con-

sider them neutral, while others account for many human

genetic diseases, so we consider them deleterious [2]. It’s

important to discriminate deleterious nsSNPs from neu-

tral ones for the study of human disease.

So far, many methods have been developed to predict

possible disease association of SAPs. For example, em-

pirical rules [3,4], probabilistic models [5], and machine

learning techniques [6-12] are used to classify SAPs.

These studies use various features to distinguish delete-

rious nsSNPs from neutral ones. Some of the methods

use features derived from protein sequences [4,6-8] while

others use both sequential and structural features [9-12].

An up-to-date study predicted deleterious SAPs by add-

ing network features based on sequential features and

structural features [13]. However, a limitation of the me-

thods using structural features is that they are applicable

only when protein structures are known, while the major-

ity of proteins don’t have available structur al information.

Therefore, it’s significant to predict deleterious SAPs

with high accuracy only using sequential information.

Some previous methods can predict the disease associa-

tion of SAPs using only sequence information. Maybe

because some sequence information hasn’t been explored

or the features haven’t been optimized by the previous

studies, their prediction accuracy is not high enough to

omit further research.

In this study, we extracted a set of 130 features, and

these features are independent of protein structure. Each

SAP was encoded by 130 features, and then a SVM-

based machine-learning classifier is applied to predict the

possible disease association of SAPs. It’s not clear which

features are relevant with the prediction, so feature selec-

tion is often used to improve the prediction accuracy. On

the other hand, the parameters of the classifier have an

*This work was jointly supp orted by National Natural Science Found a-

tion of China (Grant No. 60774086 and No. 61173111) and Ph.

Programs

Foundation of Ministry of Education of China (Grant No.

20090201110027).

X. CHEN ET AL.

473

important effect on the classification performance. Thus,

we applied a UMDA-based method which can select

features and optimize parameters simultaneously. Using

the selected features and parameters, SVM classifier

achieves better prediction performance.

2. Methods

2.1. Feature Construction

2.1.1. Conservation Attributes

The degree to which a residue is conserved is very im-

portant for the classification of SAPs. We measured the

residue conservation by using PSI-BLAST [14]. PSI-

BLAST can output Position Specific Scoring Matrix

(PSSM), which is an L*20 matrix and L in this study is

the length of input sequence. First, we derived three

attributes from PSSM, that is, the observed frequencies

of original residue and substitution residue in SAP, and

their difference. Then we also obtained conservation

score of the SAP’s neighbors, from the output of PSI-

BLAST directly. Saunders’ study showed that the num-

ber of homologous sequence in alignment can reflect the

reliability of PSSM, so we added this value as an attri-

bute.

2.1.2. Motif Attributes

Motifs are some conservative segments in the protein

sequence, and they contain lots of biological information.

In this study, we used MEME [15] to extract motifs from

database. Finally, we got 27 motifs from deleterious SAP

dataset and 20 motifs from neural SAP dataset by setting

different parameters. Then for each SAP’s neighbor se-

quence, if one motif appeared in it, we set the attribute

to1. Otherwise, set the attribute to 0. In this way, we ob-

tained 47 attributes.

2.1.3. Physicochemical Properties of Amino Acid

If the difference between the physicochemical properties

of the original amino acid and the variation amino acid is

big, the SAP is likely to cause changes of protein’s func-

tion. If the difference is small, the SAP may be compati-

ble by the body and won’t be deleterious. We considered

four kinds of properties of amino acid that were molecu-

lar weight, pI value, hydrophobicity value, and conserva-

tive value. For each SAP, property values of the original

amino acid and the variation amino acid as well as their

differences of corresponding property value between the

two amino acids were calculated.

2.1.4. Amino Acid Submissi on R ates

Matthew’s report pointed that amino acid submission

rates in the deleterious SAP dataset and in the neural

SAP dataset are different. For the deleterious SAP data-

set, submission rates between physicochemically similar

amino acids are lower, and that between highly physico-

chemically different amino acids are higher. However,

amino acid submission rates for the neural SAP dataset

come to the contrary. Thus, for each SAP, the difference

between the logarithms of submission rates in the delete-

rious set and the neural set was calculated as an attribute:

=log log

ij ij

SubDif pp−

(1)

where

and

are the submission rates of amino

acid

and

in the deleterious SAP dataset and the

neural SAP dataset respectively.

2.1.5. Position Attributes

We supposed that the position where SAP is in the pro-

tein sequence has an effect on the disease association of

the SAP. Based on the above guess, we divided the pro-

tein sequence into five same areas according to the posi-

tion. These fiv e areas were n amed by

( )

1, 2,5

Areai =

Here a 5-dimension vector,

12 5

, ,,pos pospos

, was

used as features. If the SAP is located in the

Area

pos

was set to 1. Otherwise, set

pos

to -1:

S AP

Area

pos Area

∈



=∉

−



(2)

2.1.6. Stability of Protein

The physicochemical properties stability of a protein

sequence may be associated with the function change

[16]. Thus, we calculated the average property values of

SAP’s neighbor sites, including four amino acids in the

downstream and four in the upstream. The difference

between the property value of original amino acid and

the average value, as well as the difference between the

property value of variation amino acid and the average

value were also calculated.

( )8

prop i i

Avgprop prop

−

= =−

= +

∑∑

(3 )

prop prop

Odiff AvgO= −

(4)

prop prop

Vdiff AvgV= −

where

{

}

propmolecular weight, pI v alue ,

hydrophobicity value, c onserv ative value

∈

prop

represents the property value of original amino

acid and

prop

is that of variation amino acid.

prop

the property of the

ith

site around the SAP. prop

Avg

the average property value of SAP’s neighbor sites.

2.1.7. Sequence Features Used in Previous Study

Hu et al. [8] calculated 686 features which were derived

from sequence information for each SAP, and selected 10

features at last. Among the 10 selected features, is HLA

X. CHEN ET AL.

474

indicates whether the protein in which the SAP located

belongs to the HLA family, METAL and MOD_RES

shows whether the SAP is close to functional sites,

nor_diff_freq is the normalized frequency difference be-

tween original residue and substitution residue, and the

other features are defined based on entries from AAindex

that is a public database of amino acid properties.

2.2. SVM Classifier and Performance Measure

The classifier we used in our study is Support Vector

Machine (SVM), which separates transformed data with

a hyper plane in a high-dimensional space. SVM has

been widely used in classification in bioinformatics. We

adopted the LIBSVM that used the radial basis function

(RBF) as the kernel function[17].

We used grid-search method to search the best values

of the penalty coefficient C and the kernel parameter

C was set to

76 8

2 ,2 ,,2

−−



, and so does

. We tried all

the combinations of C and

, then selected the pair that

got the bes t 5-fold cross-validation accuracy.

Let “Disease” be the positive class and “Polymor-

phism” be the negative class. The overall accuracy is de-

fined as below.

TP TN

ACC TP TNFPFN

=+++

(6)

where TP is the number of true positive; TN is the num-

ber of true negative; FP is the number of false positive

and FN is the number of false negative. We also calcu-

lated the Matthew’s correlation coefficient (MCC),

which is more realistic than ACC on an unbalanced da-

taset [18].

( )()()( )

MCC

TP TNFPFN

TP FN TP FPTNFN TNFP

×−×

+++ +

(7)

The higher the ACC and MCC are, the better the clas-

sifier’s performance is.

2.3. UMDA Method

Estimation Distribution Algorithms (EDAs) [19] are ev o-

lutionary searching strateg ies without crossing and muta-

tion operators. In EDAs, the new population is sampled

from a probability distribution which is estimated from

the fittest individuals. Univariate marginal distribution

algorithm (UMDA) is a type of EDAs, and UMDA as-

sumes that the variables in the problem were independent.

In our study, UMDA was used to search for the “best”

solution, including a feature subset and a set of SVM

parameters. In the UMDA-based method, the binary

coding chromosome representation is shown as F igure

1. In Figure 1, 1

denotes the feature mask, the

v



Figure 1. The binary coding chromosome.

bit with value “1” indicates the feature is selected and “0”

indicates not.

denote the value of pa-

rameter C and

, respectively.

, c

n and

are

the number of bits indicating the features, parameters C

and

. c

n and

are chosen according to the calcu-

lation precision required. In our study,

130

and

15n

The procedure of UMDA-based method is given as

follows:

S1: Generate M individuals randomly.

S2: Evaluate the fitn ess value (ACC) of each ind ividu-

al.

S3: Sort the individuals according to their fitness val-

ues from high to low.

S4: Select

(

NM≤

) individuals with higher fit-

ness value from population.

S5: Estimate the probability distribution of the

tth

ite-

ration

()

by the selected individuals.

( )

( |)

nSel

NSel

nji it

pxp xpop

Xx pop

−

∏

∑

∏

(8)

where

is the number of variables, and

( )

Sel

ji il

Xx popothers

−



== 



(9)

S6: Retain

fittest individuals and sample

MN−

in-

dividual s f r om

()

to form t he next generati on.

S7: If the termination criterion is satisfied, stop. Oth-

erwise, go to S2. The maximum number of iterations is

the termination criterion.

3. Experiment and Result

In this study, we acquired SAP data from Swiss-Prot

knowledgebase for the prediction of SAPs [20]. SAPs in

the Swiss-Prot are classified into three categories, “Dis-

ease”, “Polymorphism”, and “Unclassified”. SAP with

disease association is annotated with “Disease” (same as

“deleterious”); SAP with no reported disease association

is annotated with “Polymorphism” (same as “neutral”);

and SAP whose disease association is unclear is anno-

tated with “Unclassified”. We deleted “Unclassified”

SAPs and focused on “Disease” and “Polymorphism”

ones. After that, the dataset consisted of 19510 disease

SAPs and 33701 polymorphism SAPs. We extracted

5000 SAPs randomly according to the proportion of dis-

ease and polymorphism ones for the prediction of SAPs.

In other words, the final dataset contained 1833 disease

X. CHEN ET AL.

475

SAPs and 3167 neutral SAPs. Hu’s study [8] and Ye’s

study [11] used the data selected from Swiss-Prot, the

structural information of which was available. By con-

trast, we use the data selected randomly from the data-

base were more objective and more reliable.

We compared our method with Hu’s method that pre-

dicted SAP’s disease association using sequence-derived

information. Hu et al. used a greedy approach to select

features useful for the classification of SAPs and 10 fea-

tures were selected. Using the 10 features, a decision tree

method can achieve a high accuracy.

First, we applied Hu’s method on our dataset using the

10 features and decision tree method. In the recent study,

we added 120 features to the previous 10 features, so the

new feature set was obtained. Next, we used a decision

tree which was used in Hu’s method to identify delete-

rious SAPs with the new feature set, and this method was

called NF_DecisionTree. Then the decision tree was re-

placed with SVM that used grid search to tune parame-

ters, and this method was called NF_GridSVM. At last,

we applied the UMDA based method to optimize the

prediction performance, and this method was called

NF_UMDA. Table 1 compared the results of above me-

thods.

On the same dataset, Hu’s method achieved 68.06%

accuracy and 0.2783MCC, while NF_DecisionTree

achieve 73.9% accuracy and 0.4495 MCC. So we can see

that using the same classifier, the added features in the

recent study contributed to the improvement in prediction

performance.

As was shown in Table 1, NF_GridSVM achieved

much higher accuracy and MCC than NF_DecisionTree.

The reason may be that, compared with decision tree,

RBF kernel SVM can achieve good generalization and

thus is suitable for the data which may contain irrelevant

features.

Moreover, it’s observed that NF_UMDA can achieve

considerably better prediction performance than NF_

GridSVM. This can be explained by the fact that the

UMDA-based method is able to select features that are

more relevant and classifier parameters that are more ap-

propriate, and then obtain a better classification.

Hu et al. also compared their method with SIFT [4]

that was a popular sequence-based method to predict

whether a SAP is deleterious or neutral. SIFT achieved

Table 1. Predic ti on performance of various methods.

Method TP FN FP TN ACC MCC

Hu’s method 790 1043 554 2613 68.06% 0.2783

NF_DecisionTree 982 851 664 2503 69.70% 0.3344

NF_GridSVM 1263 2432 735 570 73.90% 0.4495

NF_UMDA 1323 2475 692 510 75.96% 0.4944

only 0.33 MCC using all SAPs from the Swiss-Prot da-

tabase, while our method achieved an increase of 0.164

in MCC over SIFT.

4. Conclusion

We explored sequence features deeply and introduced

UMDA into the current study to search the solution that

includes a feature subset and a set of SVM parameters

maximizing the prediction performance on our dataset.

The experimental result showed that the UMDA-based

search strategy considerably improved the prediction per-

formance. Our method used only information derived

from protein sequence, so the method can be applied to

predict all the SAPs. The performance of the proposed

method is higher than Hu’s method and SIFT. Further

research will introduce more features and try other evo-

lutionary algorithms to improve the prediction accuracy

of SAP’s disease association.

References

[1] F. S. Collins, L. D. Brooks and A. Chakravarti, “A DNA

Polymorphism Discovery Resource for Research on Hu-

man Gene tic Variati on,” Genome Research, Vol. 8, 1998,

pp. 1229-1231.

[2] P. C. Ng and S. Henikoff, “Accounting for Human Poly-

morphisms Predicted to Affect Protein Function,” Ge-

nome Research, Vol. 12, 2002, pp. 436-446.

http://dx.doi.org/10.1101/gr.212802

[3] S. Herrgard, S. A. Cammer, B. T. Hoffman, S. Knutson,

M. Gallina, J. A. Speir, J. S. Fetrow and S. M. Baxter,

“Prediction of Deleterious Functional Effects of Amino

Acid Mutations Using a Library of Structure-Based Func-

tion Descriptors,” Proteins-Structure Function and Ge-

netics, Vol. 53, 2003, pp. 806-816.

http://dx.doi.org/10.1002/prot.10458

[4] P. C. Ng and S. Henikoff, “Predicting Deleterious Amino

Acid Substitutions,” Genome Research, Vol. 11, 2001, pp.

863-874. http://dx.doi.org/10.1101/gr.176601

[5] D. Chasman and R. M. Adams, “Predicting the Function-

al Consequences of Non-Synonymous Single Nucleotide

Polymorphisms: Structure-Based Assessment of Amino

Acid Variation,” Journal of Molecular Biology, Vol. 307,

2001, pp. 683-706.

http://dx.doi.org/10.1006/jmbi.2001.4510

[6] Y. Bromberg and B. Rost, “SNAP: Predict Effect of Non-

Synonymous Polymorphisms on Function,” Nucleic Ac-

ids Research, Vol. 35, 2007, pp. 3823-3835.

http://dx.doi.org/10.1093/nar/gkm238

[7] E. Capriotti, R. Calabrese and R. Casadio, “Predicting the

Insurgence of Human Genetic Diseases Associated to

Single Point Protein Mutations with Support Vector Ma-

chines and Evolutionary Information,” Bioinformatics,

Vol. 22, 2006, pp. 2729-2734.

http://dx.doi.org/10.1093/bioinformatics/btl423

[8] J. Hu and C. Yan, “Identification of Deleterious Non-

X. CHEN ET AL.

476

Synonymous Single Nucleotide Polymorphisms Using

Sequence-Derived Information,” BMC Bioinformatics,

Vol. 9, 2008. http://dx.doi.org/10.1186/1471-2105-9-297

[9] L. Bao and Y. Cui, “Prediction of the Phenotypic Effects

of Non-Synonymous Single Nucleotide Polymorphisms

Using Structural and Evolutionary Information,” Bioin-

formatics, Vol. 21, 2005, pp. 2185-2190.

http://dx.doi.org/10.1093/bioinformatics/bti365

[10] V. G. Krishnan and D. R. Westhead, “A Comparative

Study of Machine-Learning Methods to Predict the Ef-

fects of Single Nucleotide Polymorphisms on Protein

Function,” Bioinformatics, Vol. 19, 2003, pp. 2199-2209.

http://dx.doi.org/10.1093/bioinformatics/btg297

[11] Z.-Q. Ye, S.-Q. Zhao, G. Gao, X.-Q. Liu, R. E. Langlois,

H. Lu and L. Wei, “Finding New Structural and Sequence

Attributes to Predict Possi ble Disease Association of Sin-

gle Amino Acid Lpolymorphism (SAP),” Bioinformatics,

Vol. 23, 2007, pp. 1444-1450.

http://dx.doi.org/10.1093/bioinformatics/btm119

[12] P. Yue, Z. L. Li and J. Moult, “Loss of Protein Structure

Stability as a Major Causative Factor in Monogenic Dis-

ease,” Journal of Molecular Biology, Vol. 353, 2005, pp.

459-473.

http://dx.doi.org/10.1016/j.jmb.2005.08.020

[13] T. Huang, P. Wang, Z.-Q. Ye, H. Xu, Z. He, K.-Y. Feng,

L. Hu, W. Cui, K. Wang, X. Dong, L. Xie, X. Kong, Y.-D.

Cai and Y. Li, “Prediction of Deleterious Non-Syno-

nymous SNPs Based on Protein Interaction Network and

Hybrid Properties,” PloS One, Vol. 5, 2010.

[14] S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z.

Zhang, W. Miller and D. J. Lipman, “Gapped BLAST

and PSI-BLAST: A New Generation of Protein Database

Search Programs,” Nucleic Acids Research, Vol. 25, 1997,

pp. 3389-3402.

[15] T. L. Bailey, C. Elkan, S. D. D. o. C. S. University of

California, and Engineering, Fitting a Mixture Model by

Expectation Maximization to Discover Motifs in Bipoly-

mers, Citeseer, 1994.

[16] J. Cheng, A. Randall and P. Baldi, “Prediction of Protein

Stability Changes for Single-Site Mutations Using Sup-

port Vector Machines,” Proteins: Structure, Func t ion,

and Bioinformatics, Vol. 62, 2006, pp. 1125-1132.

http://dx.doi.org/10.1002/prot.20810

[17] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Sup-

port Vector Machines,” ACM Transactions on Intelligent

Systems and Technology, Vol. 2, 2011, pp. 1-27.

http://dx.doi.org/10.1145/1961189.1961199

[18] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen and H.

Nielsen, “Assessing the Accuracy of Prediction Algo-

rithms for Classification: An Overview,” Bioinformatics,

Vol. 16, 2000, pp. 412-424.

[19] P. Larranaga and J. A. Lozano, “Estimation of Distribu-

tion Algorithms: A New Tool for Evolutionary Computa-

tion,” Vol. 2, Springer, The Netherlands, 2002.

[20] Y. L. Yip, H. Scheib, A. V. Diemand, A. Gattiker, L. M.

Famiglietti, E. Gasteiger and A. Bairoch, “The Swiss-Prot

Variant Page and the ModSNP Database: A Resource for

Sequence and Structure Information on Human Protein

Variants,” Human Mutation, Vol. 23, 2004, pp. 464-470.