Engineering, 2013, 5, 68-72 Published Online October 2013 (
Copyright © 2013 SciRes. ENG
Semantic Si m i l arity over Gene Ontol ogy for Multi-Label
Protein Subcellular Localization
Shibiao Wan1, Man-Wa i Mak1, Sun-Yuan Kung2
1Department of Electronic and Information Engineering, The Hong Kong Polytechnic University,
Hung Hom, Hong Kong, China
2Department of Electr ical Engineering, Princeton University, New Jersey, USA
Received November 2012
As one of the essential topics in proteomics and molecular biology, protein subcellular localization has been extensively
studied in previous decades. However, most of the methods are limited to the prediction of single-location proteins. In
many studies, multi-location proteins are either not considered or assumed not existing. This paper proposes a novel
multi-label subcellular-localization predictor based on the semantic similarity between Gene Ontology (GO) terms.
Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the homologous acces-
sion numbers of the protein are used as keys to search against the gene ontology annotation database to obtain a set of
GO terms. The semantic similarity between GO terms is used to formulate se mantic similarity vector s for classification .
A support vector machine (SVM) classifier with a new decision scheme is proposed to classify the multi-label GO se-
mantic similarity vectors. Experimental results show that the proposed multi-label predictor significantly outperforms
the state-of-the-art predictors s uch as iLoc-Plant and Plant-mPLoc.
Keywords: Protein Subcellular Localization; Semantic Similarity; GO Terms; Multi-L abel Clas s if ica tio n
1. Introduction
In recent years, protein subcellular localization has gain-
ed tremendous attention due to its important roles in elu-
cidating protein functions, identifying drug targets, and
so on [1]. Computational methods are required to replace
time-consuming and laborious wet-lab methods for pre-
dicting the subcellular locations of proteins.
Conventional methods for subcellular-localization pre-
diction can be roughly divided into sequence-based me-
thods [2-6] and annotation-based methods [7-13]. It has
been dem ons trated t ha t methods based on Gene Ont ology
are superior [10]. However, most of the existing methods
are limited to the prediction of single-location proteins.
These methods generally exclude the multi-label proteins
or are based on the assumption that multi-location pro-
teins do not exist. In fact, there exist multi-location pro-
teins that can simultaneously reside at, or move between,
two or more different subcellular locations. Recently,
several multi-label predictors have been proposed, in-
cluding Plant-mPLoc [14], Virus-mPLoc [15], iLoc-Plant
[16] and iLoc-Virus [17]. These predictors use the GO
information and have demonstrated superiority over other
methods. But these predictors only make use of the oc-
currences of the GO terms and do not exploit the seman-
tic relationships between GO terms.
Since the relationship between GO terms reflects the
association between different gene products, protein se-
quences annotated with GO terms can be compared on
the basis of semantic similarity measures. Actually, the
semantic similarity over Gene Ontology has been exten-
sively studied and have been applied in many biological
problems, including protein function prediction [18],
subnuclear localization prediction [19], protein-protein
interaction inference [20] and microarray clustering [21].
The performance of these predictors depends on whether
the similarity measure is relevant to the biological prob-
lems. Over the years, a number of semantic similarity
measures have been proposed, some of which have been
used in natural language processing. For example, Resnik
[22] proposed the information content of terms in natural
language as a similarity measure. Later, Lord et al. [23]
introduced this idea into measuring the semantic similar-
ity of GO terms. Lin et a l. [24] proposed a method based
on information theory and structural information. More
recently, Pesquita et al. [25] reviewed the semantic simi-
larity measures applied to biomedical ontologies.
This paper proposes a novel predictor based on the GO
semantic similarity for multi-label protein subcellular
localization prediction. The predictor proposed is differ-
ent from other predictors in that 1) it formulates the fea-
Copyright © 2013 SciRes. ENG
ture vectors by the semantic similarity over Gene Ontol-
ogy which contains richer information than only GO
terms; 2) it adopts a new strategy to incorporate richer
and more useful homologous information from more
distant homologs rather than using the top homologs only;
3) it adopts a new decision scheme for an SVM classifi-
er so that it can effectively deal with datasets containing
both single-label and multi-label proteins. Results on a
recent benchmark dataset demonstrate that these three
properties enable the proposed predictor to accurately
predict multi-location proteins and outperform three
state-of-the-art predictors.
2. Method
2.1. Retrieval of GO Terms
The proposed predictor can use either the accession
numbers (AC) or amino acid (AA) sequences of query
proteins as input. Specifically, for proteins with known
ACs, their respective GO terms are retrieved from the
Gene Ontology Annotation (GOA) database1 using the
ACs as the searching keys. For proteins without ACs,
their AA sequences are presented to BLAST [26] to f ind
their homologs, whose ACs are then used as keys to
search against the GOA database.
While the GOA database allow s us to associate the AC
of a protein with a set of GO terms, for some novel pro-
teins, neither their ACs nor the ACs of their top homo-
logs have any entries in the GOA database; in other
words, no GO terms can be retrieved by using their ACs
or the ACs of their top homologs. In such case, the ACs
of the homologous proteins, as returned from BLAST
search, will be successively used to search against the
GOA database until a match is found. With the rapid
progress of the GOA database, it is reasonable to ass ume
that the homologs of the query proteins have at least one
GO term [12]. Thus, it is not necessary to use back-up
methods to handle the situation where no GO terms can
be found. T he procedur e s are outli ned in Figure 1.
2.2. Semantic Similarity Measure
To obtain the GO semantic similarity between two pro-
teins, we should start by introducing the semantic simi-
larity between two GO terms. The semantic similarity
between two categories is based on the information con-
tent. As suggested by Resnik [22], the similarity measure
of two categories relies on the most specific common
ancestor in the GO hierarchy2. The semantic similarity
between two GO terms x and y is defined as [22]:
Figure 1. Procedures of retrieving GO terms.
()( )
( )
sim ,maxlog
c Axy
xy pc
= −
where A(x,y) is the set of ancestor GO terms of both x
and y, and p(c) is the number of gene products annotated
to the GO term c divided by the number of all the gene
products annotated to the GO taxonomy.
To further incorporate structural information from the
GO hierarchy, we u sed Lin’s measures [24] to normalize
the above measure. Then given two GO terms x and y,
the similarity is calculated as:
( )( )
( )
( )
( )
( )
( )
2 maxlog
sim ,log log
c Axy
xy px py
Based on the semantic similarity between two GO
terms, we adopted a continuous measure proposed in [21]
to calculate the similarity of two proteins, which are
functionally annotated by a set of GO terms. Given two
proteins Pi and Pj, which are annotated by two sets of
GO terms Gi and Gj retrieved in Section II-A3, we first
computed S(Gi, Gj) as follows:
( )
( )
max sim
S ,x,y
where sim(x, y) is define d in Equation (2).
2The relationsh ips betwe en GO ter ms in the GO hierarch y, such as is-a”
ancestor-child, or part-of” ancestor-
child can be obtained from the
SQL database through the link:
termdb/go_daily-termdb-tables.tar.gz. Note here only the “is-a” rela-
tionship is considered for semantic similarity analysis [22].
3Strictly speaking, Gi should be Gi,ki, where ki is the ki-
th homolog used
to retrieve the GO terms in Section II-A f or th e i-th
protein. To simplify
notations, we write it as Gi.
Copyright © 2013 SciRes. ENG
Then, S(Gj, Gi) is computed in the same way by swap-
ping Gi and Gj. Finally, the overall similarity between the
two proteins is given by:
() ( )( )
( )
( )
S,S ,
SS ,.
S,S ,
i jji
iij j
Thus, for a testing protein Qt, a GO semantic si milarity
vector qt can be formulated by performing pairwise
comparisons with every training protein
{ }
, where
N is the number of training proteins. Then, qt can be
represented as:
where Qt is the set of GO terms for the test protein Qt.
2.3. Multi-Label Multi-Class SVM Classificat ion
To predict the subcellular locations of both single-label
and multi-label proteins, a multi-label support vector
machine (SVM) classifier is proposed in this paper. Spe-
cifically, deno te the GO semantic similarity vector of the
t-th query protein as qt. Then, given the t-th query prote in
Qt, the score of the m-th SVM is
( )
(), ,
mtmr mrrtm
syK b
= +
Q pq
where Sm is the set of support vector indexes corres-
ponding to the m-th SVM, αm;r are the Lagrange multip-
liers, K(,) is a kernel function; here, the linear kernel is
used. ym;r ϵ{1, +1} are the class labels.
Unlike the single-label problem where each protein
has one predicted label only, a multi-label protein could
have more than one predicted labels. Thus, the predicted
subcellular location(s) of the t-th query protein are given
( )
{ }
argmax, otherwise.
() ()
mt mt
> ∃>
3. Results
3.1. Dataset and Performance Metrics
In this paper, the plant dataset used in Plant-mPLoc [14],
iLoc-Plant [16] an d mGOASVM [27]4 were used to eva-
luate the performance of the proposed predictor. The
plant dataset was created from Swiss-Prot 55.3. It con-
tains 978 plant proteins distributed in 12 locations. Of the
978 plant proteins, 904 belong to one subcellular location,
71 to two locations, 3 to three locations and none to four
or more locations. In other words, 8% of the plant pro-
teins in this dataset are located in multiple locations. The
sequence identity of this dataset was cut off at 25%.
To facilitate comparison, the locative accuracy [28]
and the actual accuracy were used to assess the predic-
tion performance. Specifically, denote L(pi) and M(pi) as
the true label set and the predicted label set for the i-th
protein pi (i = 1,…, Nact), respectively. Then, the overall
locative accuracy is:
( )( )
loc loc
where || means counting the number of elements in the
set therein and ∩ represents the intersection of sets, Nact
represents the total number of actual proteins and Nloc
represents the total number of locative proteins. And the
overall actual accuracy is:
( )( )
act act
= ∆
( )( )( )( )
0, otherwise
Note that the actual accuracy is more objective and
stricter than the locative accuracy [27].
3.2. Comparing with State-of-the-Art Predictors
Table 1 compares the performance of the proposed pre-
dictor against three state-of-the-art multi-label predictors
on the plant dataset. Plant-mPLoc [14], iLoc-Plant [16]
and mGOASVM [27] use the accession numbers of ho-
mologs returned from BLAST [26] as searching keys to
retrieve GO terms from the GOA database. For a fair
comparison with these predictors, th e performance of our
proposed predictor shown in Table I was obtained by
using the accession numbers of homologous proteins as
the searching keys. Unlike Plant-mPLoc and iLoc-Plant,
the ACs of the homologous proteins, as returned from
BLAST search, will be successively used to search
against the GOA database until a match is found (See
Figure 1 for details).
As shown in Table 1, our proposed predictor p erforms
significantly better than Plant-mPLoc and iLoc-Plant.
Both the overall locative accuracy and overall actual ac-
curacy of mGOASVM are more than 20% (absolute)
higher than iLoc-Plant (97.9% vs 71.7% and 89.6% vs
68.1%, respectively). Our proposed predictor also per-
forms better than mGOASVM in terms of both the over-
all actual accuracy (89.6% vs 97.4%) and the overall
locative accuracy (97.9% vs 96.2%). As for the individu-
al locative accuracy, the individual locative accuracies of
our proposed predictor for all of the 12 locations are im-
pressively higher than those of Plant-mPLoc, iLoc-Plant
and mGOASVM.
In terms of GO information extraction, Plant-mPLoc,
iLoc-Plant and mGOAS VM only exploit the occur rences
4 ml
Copyright © 2013 SciRes. ENG
Table 1. Comparing the proposed predictor with state-of-the-art multi-label predictors based on leave-one-out cross valida-
tion (LOOCV). “-” means the corresponding references do not provide the overall actual accuracy.
Label Subcellular Location LOOCV Locative Accuracy
Plant-mPLoc [12] iLoc-Plant [14] mGOASVM [25] Proposed Predictor
1 Cell membrane 24/56 = 42.9% 39/56 = 69.6% 53/56 = 94.6% 55/56 = 98.2%
2 Cell wall 8/32 = 25.0% 19/32 = 59. 4% 27/32 = 84.4% 28/32 = 87.5%
3 Chloroplast 248/286 = 86.7% 252/286 = 88.1% 272/286 = 95.1% 285/286 = 99.7%
4 Chloroplast 72/182 = 39.6% 114/182 = 62.6% 174/182 = 95.6% 175/182 = 96.2%
5 Endoplasmic 17/42 = 40.5% 21/42 = 50.0% 38/42 = 90.5% 40/42 = 95.2%
6 Extracellular 3/22 = 13.6% 2/22 = 9.1% 22/22 = 100.0% 22/22 = 100.0%
7 Golgi apparatus 6/21 = 28.6% 16/21 = 76.2% 19/21 = 90.5% 18/21 = 85.7%
8 Mitochondrion 114/150 = 76.0% 112/1 50 = 74.7% 150/1 50 = 100.0% 150/1 50 = 100.0%
9 Nucleus 136/152 = 89.5% 140/152 = 92 .1% 151/152 = 99.3% 150/152 = 98.7%
10 Peroxisome 14/21 = 66.7% 6/21 = 28.6% 21/21 = 100.0% 21/21 = 100.0%
11 Plastid 4/39 = 10.3% 7/39 = 17.9% 39/39 = 100.0% 39/39 = 100.0%
12 Vacuole 26/52 = 50.0% 28/52 = 53.8% 49/52 = 94.2% 50/52 = 96.2%
Overall Locative Accuracy 672/1055 = 63.7% 756/1 055 = 71.7% 1015/10 55 =96.2% 1033/1055 = 97.9%
Overall Actual Accuracy - 666/978 = 68.1% 855/978 = 87.4% 876/978 = 89.6%
of GO terms, whereas the proposed predictor discovers
the semantic relationships between GO terms, based on
which the semantic similarity between proteins (from the
GO annotation perspective) can be obtained. The supe-
rior performance of the proposed predictor clearly sug-
gests that the semantic similarity over Gene Ontology is
conducive to the prediction of multi-label protein sub-
cellular localization.
4. Conclusions and Future Works
This paper proposes a new multi-label predictor based on
Gene Ontology semantic similarity to predict the subcel-
lular locations of multi-label proteins. By using the ac-
cession numbers of the homologs of the query proteins as
the searching keys to search against the GO annotation
database, the GO terms of each query protein are retri-
eved. Then the information of the semantic similarity
over GO terms is exploited, which is further utilized to
formulate GO semantic similarity vectors for every query
protein. The feature vectors are subsequently recognized
by support vectors machine (SVM) classifiers equipped
with a decision strategy that can produce multiple class
labels for a query protein. Experimental results demon-
strate that the proposed predictor can efficiently predict
the subcellular locations of multi-label proteins. It was
also found that the explo itation of the s emantic similarity
over Gene Ontology is conducive to multi-label protein
subcellular localization prediction. There are many dif-
ferent methods [20,22,23] for measuring the GO seman-
tic similarity. The semantic similarity measure used in
this paper may not be the best for protein subcellular
location. Therefore, as a future work, it is of interest to
develop a similarity measure that is more relevant to
subcellular localization.
5. Acknowledgements
This work was in part supported by The HK RGC Grant
No. PolyU5264/09E and HKPolyU Grant No. G-YJ86.
We also thank Lixin Cheng for his early effort on the perl
[1] K. C. Chou and Y. D. Cai, “Predicting Protein Localiza-
tion in Budding Yeast,” Bioinformatics, Vol . 21, 2005, pp.
[2] H. Nakashima and K. Nishikawa, “Discrimination of
Intracellular and Extracellular Proteins Using Amino Ac-
id Composition and Residue-Pair Frequencies,” Journal
of Molecular Biology, Vol. 238, 1994, pp. 54-61.
[3] K. C. Chou, “Prediction of Protein Cellular Attributes Us-
ing Pseudo Amino Acid Composition,” Proteins: Struc-
ture, Function, and Genetics, Vol. 43, 2001, pp. 246-255.
[4] O. Emanuelsson, H. Nielsen, S. Brunak and G. von Hei-
jne, “Predicting Subcellular Localization of Proteins Bas-
ed on Their N-Terminal Amino Acid Sequence,” Journal
of Molecular Biology, Vol. 300, No. 4, 2000, pp. 1005-
[5] H. Nielsen, J. Engelbrecht, S. Brunak and G. von Heijne,
“A Neural Network Method for Identification of Proka-
ryotic and Eukaryotic Signal Peptides and Prediction of
Their Cleavage Sites,” International Journal of Neural
Copyright © 2013 SciRes. ENG
Systems, Vol. 8, 1997, pp. 581-599.
[6] M. W. Mak, J. Guo and S. Y. Kung, “PairProSVM: Pro-
tein Subcellular Localization Based on Local Pairwise
Profile Alignment and SVM,” IEEE/ACM Tr ansactions
on Computational Biology and Bioinf ormatics, Vol. 5, No.
3, 2008, pp. 416-422.
[7] S. Wan, M. W. Mak and S. Y. Kung, “Protein Subcellular
Localization Prediction Based on Profile Alignment and
Gene Ontology,” 2011 IEEE International Workshop on
Machine Learning for Signal Processing (MLSP’11),
September 2011, pp. 1-6.
[8] K. C. Chou and Y. D. Cai, “Prediction of Protein Subcel-
lular Locations by GO-FunD-PseAA Predictor,” Bioche-
mical and Biophysical Research Communications, Vol.
320, 2004, pp. 1236-1239.
[9] S. Wan, M. W. Mak and S. Y. Kung, “GOASVM: A Sub-
cellular Location Predictor by Incorporating Term-Fre-
quency Gene Ontology into the General Form of Chou’s
Pseudo-Amino Acid Composition,” Journal of Theoreti-
cal Biology, Vol. 323, 2013, pp. 40-48.
[10] K. C. Chou and H. B. Shen, “Predicting Eukaryotic Pro-
tein Subcellular Location by Fusing Optimized Evidence-
Theoretic K-Nearest Neighbor Classifiers,” Journal of
Proteome Research, Vol. 5, 2006, pp. 1888-1897.
[11] S. Wan, M. W. Mak and S. Y. Kung, “Adaptive Thre-
sholding for Multi-Label SVM Classification with Ap-
plication to Protein Subcellular Localization Prediction,
2013 IEEE Internation al Conference on Acoustics, Speech,
and Signal Processing (ICA SSP’13), 2013, pp. 3547-
[12] S. Mei, “Multi-Label Multi-Kernel Transfer Learning for
Human Protein Subcellular Localization,” PLoS ONE,
Vol. 7, No. 6, 2012, Article ID: e37716.
[13] S. Wan, M. W. Mak and S. Y. Kung, “GOASVM: Protein
Subcellular Localization Prediction Based on Gene On-
tology Annotation and SVM,” 2012 IEEE International
Conference on Acoustics, Speech, and Signal Processing
(ICASSP’12), 2012, pp. 2229-2232.
[14] K. C. Chou and H. B. Shen, “Plant-mPLoc: A Top-Down
Strategy to Augment the Power for Predicting Plant Pro-
tein Subcellular Localization,” PLoS ONE, Vol. 5, 2010,
Article ID: e11335.
[15] H. B. Shen and K. C. Chou, “Virus-mPLoc: A Fusion
Classifier for Viral Protein Subcellular Location Predic-
tion by Incorporating Multiple Sites,” Journal of Biomo-
lecular Structure & Dynamics, Vol. 26, 2010, pp. 175-
[16] Z. C. Wu, X. Xiao and K. C. Chou, “iLoc-Plant: A Mul-
ti-Label Classifier for Predicting the Subcellular Locali-
zation of Plant Proteins with Both Single and Multiple
Sites,” Molecular BioSystems, Vol. 7, 2011, pp. 3287-
[17] X. Xiao, Z. C. Wu and K. C. Chou, “iLoc-Virus: A Mul-
ti-Label Learning Classifier for Identifying the Subcellu-
lar Localization of Virus Proteins with Both Single and
Multiple Sites,” Journal of Theoretical Biology, Vol. 284,
2011, pp. 42-51.
[18] M. Zhu, L. Gao, Z. Guo, Y. Li, D. Wang, J. Wang and C.
Wang, “Globally Predicting Protein Functions Based on
Co-Expressed Protein-Protein Interaction Networks and
Ontology Taxonomy Similarities,” Gene, Vol. 391, No.
1-2, 2007, pp. 113-119.
[19] Z. Lei and Y. Dai, “Assessing Protein Similarity with
Gene Ontology and Its Use in Subnuclear Localization
Prediction,” BMC Bioinformatics, Vol. 7, 2006, p. 491.
[20] X. Wu, L. Zhu, J. Guo, D. Y. Zhang and K. Lin, “Predic-
tion of Yeast Protein-Protein Interaction Network: In-
sights from the Gene Ontology and Annotations,” Nucleic
Acids Research, Vol. 34, No. 7, 2006, pp. 2137-3150.
[21] D. Yang, Y. Li, H. Xiao, Q. Liu, M. Zhang, J. Zhu, W.
Ma, C. Yao, J. Wang, D. Wang, Z. Guo and B. Yang,
“Gaining Confidence in Biological Interpretation of the
Microarray Data: The Functional Consistence of the Sig-
nificant GO Categorie s,” Bioinformatics, Vol. 24, No. 2,
2008, pp. 265-271.
[22] P. Resnik, “Semantic Similarity in a Taxonomy: An In-
formation-Based Measure and Its Application to Prob-
lems of Ambiguity in Natural Language,” Journal of Ar-
tificial Intelligence Research, Vol. 11, 1999, pp. 95-130.
[23] P. W. Lord, R. D. Stevens, A. Brass and C. A. Goble,
“Investigating Semantic Similarity Measures across the
Gene Ontology: The Relationship between Sequence and
Annotation,” Bioinformatics, Vol. 19, No. 10, 2003, pp.
[24] D. Lin, “An Information-Theoretic Definition of Similar-
ity,” Proceedings of the 15th International Conference on
Machine Learning, 1998, pp. 296-304.
[25] C. Pesquita, D. Faria, A. O. Falcao, P. Lord and F. M.
Counto, “Semantic Similarity in Biomedical Ontologies,”
PLoS Computational Biology, Vol. 5, No. 7, 2009, Article
ID: e1000443.
[26] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z.
Zhang, W. Miller and D. J. Lipman, “Gapped BLAST
and PSI-BLAST: A New Generation of Protein Database
Search Programs,Nucleic Acids Research, Vol. 25, 1997,
pp. 3389-3402.
[27] S. Wan, M. W. Mak and S. Y. Kung, “mGOASVM: Mul -
ti-Label Protein Subcellular Localization Based on Gene
Ontology and Support Vector Machines,” BMC Bioin-
formatics, Vol. 13, 2012, p. 290.
[28] K. C. Chou and H. B. Shen, “Recent Progress in Protein
Subcellular Location Prediction,” Analytical Biochemistry,
Vol. 1, No. 370, 2007, pp. 1-16.