X. CHEN ET AL.
Copyright © 2013 SciRes. ENG
SAPs and 3167 neutral SAPs. Hu’s study [8] and Ye’s
study [11] used the data selected from Swiss-Prot, the
structural information of which was available. By con-
trast, we use the data selected randomly from the data-
base were more objective and more reliable.
We compared our method with Hu’s method that pre-
dicted SAP’s disease association using sequence-derived
information. Hu et al. used a greedy approach to select
features useful for the classification of SAPs and 10 fea-
tures were selected. Using the 10 features, a decision tree
method can achieve a high accuracy.
First, we applied Hu’s method on our dataset using the
10 features and decision tree method. In the recent study,
we added 120 features to the previous 10 features, so the
new feature set was obtained. Next, we used a decision
tree which was used in Hu’s method to identify delete-
rious SAPs with the new feature set, and this method was
called NF_DecisionTree. Then the decision tree was re-
placed with SVM that used grid search to tune parame-
ters, and this method was called NF_GridSVM. At last,
we applied the UMDA based method to optimize the
prediction performance, and this method was called
NF_UMDA. Table 1 compared the results of above me-
thods.
On the same dataset, Hu’s method achieved 68.06%
accuracy and 0.2783MCC, while NF_DecisionTree
achieve 73.9% accuracy and 0.4495 MCC. So we can see
that using the same classifier, the added features in the
recent study contributed to the improvement in prediction
performance.
As was shown in Table 1, NF_GridSVM achieved
much higher accuracy and MCC than NF_DecisionTree.
The reason may be that, compared with decision tree,
RBF kernel SVM can achieve good generalization and
thus is suitable for the data which may contain irrelevant
features.
Moreover, it’s observed that NF_UMDA can achieve
considerably better prediction performance than NF_
GridSVM. This can be explained by the fact that the
UMDA-based method is able to select features that are
more relevant and classifier parameters that are more ap-
propriate, and then obtain a better classification.
Hu et al. also compared their method with SIFT [4]
that was a popular sequence-based method to predict
whether a SAP is deleterious or neutral. SIFT achieved
Table 1. Predic ti on performance of various methods.
Method TP FN FP TN ACC MCC
Hu’s method 790 1043 554 2613 68.06% 0.2783
NF_DecisionTree 982 851 664 2503 69.70% 0.3344
NF_GridSVM 1263 2432 735 570 73.90% 0.4495
NF_UMDA 1323 2475 692 510 75.96% 0.4944
only 0.33 MCC using all SAPs from the Swiss-Prot da-
tabase, while our method achieved an increase of 0.164
in MCC over SIFT.
4. Conclusion
We explored sequence features deeply and introduced
UMDA into the current study to search the solution that
includes a feature subset and a set of SVM parameters
maximizing the prediction performance on our dataset.
The experimental result showed that the UMDA-based
search strategy considerably improved the prediction per-
formance. Our method used only information derived
from protein sequence, so the method can be applied to
predict all the SAPs. The performance of the proposed
method is higher than Hu’s method and SIFT. Further
research will introduce more features and try other evo-
lutionary algorithms to improve the prediction accuracy
of SAP’s disease association.
References
[1] F. S. Collins, L. D. Brooks and A. Chakravarti, “A DNA
Polymorphism Discovery Resource for Research on Hu-
man Gene tic Variati on,” Genome Research, Vol. 8, 1998,
pp. 1229-1231.
[2] P. C. Ng and S. Henikoff, “Accounting for Human Poly-
morphisms Predicted to Affect Protein Function,” Ge-
nome Research, Vol. 12, 2002, pp. 436-446.
http://dx.doi.org/10.1101/gr.212802
[3] S. Herrgard, S. A. Cammer, B. T. Hoffman, S. Knutson,
M. Gallina, J. A. Speir, J. S. Fetrow and S. M. Baxter,
“Prediction of Deleterious Functional Effects of Amino
Acid Mutations Using a Library of Structure-Based Func-
tion Descriptors,” Proteins-Structure Function and Ge-
netics, Vol. 53, 2003, pp. 806-816.
http://dx.doi.org/10.1002/prot.10458
[4] P. C. Ng and S. Henikoff, “Predicting Deleterious Amino
Acid Substitutions,” Genome Research, Vol. 11, 2001, pp.
863-874. http://dx.doi.org/10.1101/gr.176601
[5] D. Chasman and R. M. Adams, “Predicting the Function-
al Consequences of Non-Synonymous Single Nucleotide
Polymorphisms: Structure-Based Assessment of Amino
Acid Variation,” Journal of Molecular Biology, Vol. 307,
2001, pp. 683-706.
http://dx.doi.org/10.1006/jmbi.2001.4510
[6] Y. Bromberg and B. Rost, “SNAP: Predict Effect of Non-
Synonymous Polymorphisms on Function,” Nucleic Ac-
ids Research, Vol. 35, 2007, pp. 3823-3835.
http://dx.doi.org/10.1093/nar/gkm238
[7] E. Capriotti, R. Calabrese and R. Casadio, “Predicting the
Insurgence of Human Genetic Diseases Associated to
Single Point Protein Mutations with Support Vector Ma-
chines and Evolutionary Information,” Bioinformatics,
Vol. 22, 2006, pp. 2729-2734.
http://dx.doi.org/10.1093/bioinformatics/btl423
[8] J. Hu and C. Yan, “Identification of Deleterious Non-