Protein-Protein Interaction Extraction Based on Convex Combination Kernel Function

doi:10.4236/jcc.2013.15002

Paper Menu >>

Journal Menu >>

Journal of Computer and Communications, 2013, 1, 9-13

http://dx.doi.org/10.4236/jcc.2013.15002 Published Online October 2013 (http://www.scirp.org/journal/jcc)

Protein-Protein Interac ti on Extraction Based on Convex

Combination Kernel Function

Peng Chen, Jianyi Guo, Zhengtao Yu, Sichao Wei, Feng Zhou, Xin Yan

1The School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China; 2Key

Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, China.

Email: gjade86@hotmail.com

Received August 2013

ABSTRACT

Owing to the effect of classified models was different in Protein-Protein Interaction (PPI) extraction, which was made

by different single kernel functions, and only using single kernel function hardly trained the optimal clas sif ied model to

extract PPI, this paper presents a strategy to find the optimal kernel function from a kernel function set. The strategy is

that in the kernel function set which consists of different single kernel functions, endlessly finding the last two kernel

functions on the performance in PPI extraction, using their optimal kernel function to replace them, until there is only

one kernel function and it’s the final opti mal kernel f unction . Finall y, ex tracting PPI using th e classif ied model made by

this kernel function. This paper conducted the PPI extraction experiment on AIMed corpus, the experimental result

shows that the optimal convex combination kernel function this paper presents can effectively improve the extraction

performance than single kernel function, and it gets the best precision wh ich reach es 65.0 among the similar PPI extra c-

tion systems.

Keywords: Protein-Protein Interaction; Support Vector Machine; Convex Combination Kernel Function

1. Introduction

That the presented protein entities exist interaction rela-

tion of the article in biomedical field is called Protein-

Protein Interaction. PPI is vital for knowing the single

proteins and the organization of the entire biological

process. Recently, PPI becomes an important and hot

task in biomedical field. There is a surge of research in-

terest in PPI.

Early studies on PPI ex traction mostly employ the ma-

chine learning method, it includes the convolution kernel

based method [1-3] and feature based method [4-7]. The

main difference of these two methods is the way they get

their high dimension matrixes. The convolution kernel-ba-

sed method expresses the sentence as string [2], tree [1],

graph [3] or other structured ways, and determine the

high dimension matrix by counting the number of same

substructure in two sentences. The feature based method

needs select features firstly, Miyao, Y. et al. [5] employ-

ed three syntactic analysis methods and made syntactic

information as the feature, Liu B et al. [6] used depen-

dent information as features, Bui Q-C et al. [7 ] used key-

words as the feature. When features have been determin-

ed, the method expresses the sentence as feature vector,

then gets the high dimension matrix mapping by com-

mon kernel functions. For the high computation comple-

xity, the convolution kernel based method or the compo-

site method which includes the convolution kernel-based

method is not adaptive in the practical PPI extraction

system. The feature based method which can get the high

dimension matrix at a high rate of speed becomes the

mainstream. What calls for special atten tion is that in the

feature-based method, the high dimension matrix is the

only information in getting the classified model by train-

ing, so the selection of the kernel function is crucial in

PPI extraction. Niu Y et al. [4] and Bui Q -C et a l. [7] re-

spectively used a linear kernel function and radial basis

function (RBF Kernel) as the kernel function. The effect

of classified models was different in PPI extraction, which

was made by different single different kernel functions,

and only using single kernel function hardly trained the

optimal classified model to extract PPI.

For the above problem, this paper presents a strategy

that finds the optimal kernel function from a single ker-

nel function set consists of some single kernel functions.

On the basis of corpus pretreatment, selection features

and getting high dimension matrixes by mapping from

different single kernel functions, then achieving their

classified models by training. The next step is that en-

dlessly finding the last two kernel functions on the per-

formance in PPI extraction, using their optimal kernel

Protein-Protein Interaction Extraction Based on Convex Combination Kernel Function

function to replace them, until there is only one kernel

function and it’s the final op timal kernel function. Finally,

extracting PPI using the classified model made by this

kernel function. This paper conducted the PPI extraction

experiment on AIMed corpus, the experimental result

shows that the optimal convex combination kernel func-

tion this paper presents can effectively improve the ex-

traction performance than single kernel function.

2. The PPI Extraction Method Based on

Convex Combination Kernel Function

2.1. Preprocessing

Protein entity recognition is the basis of the PPI extrac-

tion work. This paper uses AIMed corpus which has been

labeled protein entities, and therefore don’t consider is-

sues related to protein entity recognition. We have to

pretreat AIMed corpus before select features.

• In order to avoid the interference of PPI extraction by

the protein text information, we use PROT1, PROT2

respectively to represent the first protein and the se-

cond protein in the sentence, the rest of the protein

are expressed as PROT.

• Using OAOD (One Answer per Occurrence in a Do-

cument) principle processes the relationship of pro-

tein, that is, each occurrence of protein relations from

the corpus are considered to be unique.

2.2. Feature Selection

• Location Information

Location information expresses proteins position in a

sentence, using the location of “P” of protein PROT1 or

PROT2.

• Local context information

Local context information includes left word of the

first protein entity and number of other protein on the left

side of the entity, and right word of the second protein

entity and number of other protein on the right side of the

entity, as well as words between the first protein entity

and the second protein entity and number of other protein

entity. Because multiple words provide similar informa-

tion, the number of words will be divided into four levels:

no word, one word, two words, greater than or equal to

three words. In addition protein context of speech may

also contain information on the interactions between the

protein, we extract three words of speech before and after

each prot e i n in the protein pair.

• Keywords information

In the extraction of protein interaction relations, some

special keyword provides important information. In this

paper, use a list of key words which provided by [10] to

structure keyword dictionary. We take the number of

keywords as features, and the same, divided into no

keyword, a keyword, greater than or equal to two key-

word as feature.

• Interactions sentence information

Interactions sentence information indicates whether

there are other protein pairs in the sentence with protein

pair.

• Phrase syntax information

Phrase syntactic tree reflects the grammatical structure

of sentences and can express semantic information over

long distances. For example“PROT1, PROT and PROT2

levels were statistically greater in patients.”Phrase syn-

tactic tree is shown in Fig ure 1. The phrase syntactic

features selected specifically:

Syntactic tree feature 1: the common root class of two

proteins

Syntactic tree feature 2: the number of nodes forms the

first protein to the root protein

Syntactic tree feature 3: the number of nodes forms the

second pr ot e in to the root prot e in

• Dependent information

Dependent information can be revealed of the relation

of long-distance dependencies in the sentence, and can

avoid noise arising from the unstructured characteristics.

In [6] it proves that the dependent information as charac-

teristics can effectively improve the effect of PPI extrac-

tion. For example “Collectively, PROT1 is yet another

functional ligand for PROT2.” Dependent information is

shown in Figure 2. Observed dependency information,

whether there is dependence between protein entity and

in [10] it presented the keyword may provide information

for PPI extraction, therefore we add two features on the

basis of Liu et al. [6] raised the dependency information

feature, respectively:

Dependent information feature 1: Whether there is a

direct de pendenc y PR OT1 and key word.

Dependent information feature 2: Whether there is a

direct de pendenc y PR OT2 and key word.

2.3. The Optimal Convex Combination Kernel

Function

Before looking for optimal combination kernel function

Figure 1. An instance of the phrase syntactic tree.

Protein-Protein Interaction Extraction Based on Convex Combination Kernel Function

Figure 2. An instance of the dependent information.

is convex, the first thing you need is to get the PPI per-

formance of each single kernel function. Under the above

feature selection strategy, then we can convert all in-

stances to feature matrixes. In order to obtain different

PPI extraction performance of the single kernel function,

the high dimension matrixes are obtained by different

feature matrixes mapping from single kernel function,

the high dimensional matrix after training can obtain the

different classification models. Finally using the test cor-

pus and we can achieve the PPI extraction of each single

kernel function.

For convex combination kernel function, the determi-

nation of convex combination ratio parameter is a very

important problem. Suppose there are n kinds of kernel

function, such as k1, …, kn. The convex combination

kernel (CCK, Convex Combination Kernel) function

consists by the n kinds of single kernel functions is

shown in type (1).

is called the convex combination

ratio parameter,

∑

CCKa k

∑

(1)

If the convex combination parameter selection may

have m, the computational complexity is O (mn). If the

parameter selection of the range is 0 to 1, step length is

0.1, and there ar e four kinds of kernel functions, that is to

say, m is 11, n is 4, and there are 14641 times you will

need to experiment to determine the optimal parameters

of kernel function, Obviously, O (mn), such a huge com-

putational complexity is unacceptable in practical appli-

cation. Aiming at this problem, this paper presents some

principles constantly looking for two kinds of kernel

function is the optimal convex combination kernel func-

tion instead of these two kinds of kernel function strategy,

for a variety of kernel function is the optimal kernel

function is convex combination. This strategy can be de-

creased from computational complexity O (m * (n − 1)),

under the condition of just to find the optimal convex

combination kernel function only needs 33 experiments.

As for kernel function with the worst performance for the

principle, to obtain the optimal algorithm of convex com-

bination kernel function is shown in Figure 3.

The optimal convex combinati on kernel functi on generation algor it hm

Algorithm input: N kinds of kernel function is a collection of PPI ex-

traction performance F, F = {Fi, i = 1,2….,n.}

Algorithm output: The optimal kernel function is convex combina-

tion(Optimal Convex Combination Kernel, OCCk)

Algorithm steps:

While(number(F)>1)

candidate1←mi n(F )

F=F-min (F )

candidate2←min(F)

F=F-min (F )

F=F+Optima l (cand idate1,candidat e2)

if number(F)=1

then Return Optima l (candidate1,candidate2)

△Number (F) is the numbe r o f F i in F

△min (F) is the minimum value in F

△Optimal (k1, k2) is the Optimal kernel function from k1,k2

Figure 3. The optimal convex combination kernel function

generation algorithm.

3. Experiment and Results

This Paper uses AIMed corpus as the experimental data.

AIMed is the corpus of PPI extraction used most fre-

quently. It contains 225 MEDLINE abstracts, there are

177 abstracts contain PPI instance, and 48 abstracts does

not contain the PPI instance, referring to 4084 proteins

entities. After preprocessing, this paper removes 59 au-

tocorrelation PPI instances, retains 154 nested instances.

Finally, there are 1000 positive instances and 4084 nega-

tive instances. In [11,12], they verified that the SVM

classification model have better effect in PPI extraction,

therefore, this paper also uses the SVM as a machine

learning method. In addition, in order to facilitate com-

parison with other experiments, the paper uses 10 times

the cross validation method. This paper uses precision,

recall and F1 Value to evaluate the r e sult.

3.1. Experimental Method

The paper sets up three groups of experiments:

• Experiment 1 will test different kernel functions for

the same characteristics in terms of PPI extraction

performances are different. Experiment will get all

the PPI extraction performance of the single kernel

function, and sort them, then search for the optimal

preparation convex combination kernel function. This

paper uses three kinds of single kernel function in-

cluding: Radial Basis Function, Polynomial Kernel

Function, Linear kernel. Respectively, such as type

(2), (3), (4), where x and y are two arbitrary dimen-

sion vector.

( )

,exp ||

RBF

Kxyx y= −−

(2)

( )

polynomial

Kxyxy= ⋅

(3)

( )

linear

Kx yxy= ⋅

(4)

• Experiment 2 will compare three optimal convex

Protein-Protein Interaction Extraction Based on Convex Combination Kernel Function

combination principle of kernel function to find the

best, and verify that the proposed model of optimal

classification of convex combination kernel function

training is better than single kernel function training

out of classification model. Principle 1: Find two

worst performances for the optimal kernel function is

convex combination to replace the original kernel

function. Principle 2: look for two performances best

optimal convex combination of nuclear function to

replace the original kernel function. Principle 3: look

for a better performance and a worst performance op-

timal kernel function is convex combination to re-

place the original kernel function.

• Experiment 3 will compare other systems in PPI ex-

traction.

3.2. Experimental Results and Analysis

• Extraction performance of different single kernel

functions

Table 1 shows that between the RBF kernel, poly-

nomial kernel function and linear kernel function, RBF

kernel function gets the best performance in PPI extrac-

tion. The results of this sort will prepare future experi-

ments. At the same time, Table 1 also shows that differ-

ent kernel functions on the same characteristics have d if-

ferent PPI extraction performance.

• Compared with different principles for finding the

optimal kernel

Table 2 shows that, in accordance with the strategy

that finding the optimal convex combination kernel from

two worst kernels and using this optimal kernel instead

of them can get the best PPI extraction performance. But

in accordance with the strategy that finding the optimal

convex combination kernel from two best kernels and

using this optimal kernel instead of them can get the

worst PPI extraction performance. In addition, PPI ex-

traction performance of each principle is higher than

RBF kernel, which gets the best PPI extraction perfor-

mance in all single kernels.

• Compa re d with othe r sy s t e ms

Table 3 shows that, first ly, the proposed method gets

the best precision at 65% in all feature based systems,

and its F1 Value is close to 60. Niu et al. [4] is using a

linear kernel, Bui et al. [7] using the RBF kernel function.

They both use single kernel function. Compared to the

single kernel function, the convex combination kernel

function can increase the PPI extraction precision. Se-

condly, the F1 value on the feature based method has

surpassed it on convolution kernel based method, fur-

thermore, the feature based method is much faster than

the convolution kernel based method, so the feature-bas-

ed method will become the mainstream in the future re-

search in PPI extraction.

Table 1. PPI Extraction performance on different single

kernel functions.

Kernel method P/% R/% F1/%

KRBF 56.8 51.3 53.9

Kpolynomial 53.2 49.6 51.3

Klinear 51.2 45.0 50.6

Table 2. PPI Extraction performance on different prin-

ciples.

Principle method P/% R/% F1/%

Principle 1 (two best performance) 58.2 52.4 55.1

Principle 2 (two worst performance)

65.0 55.3 59.8

Principle 3 (one best one worst) 57.3 53.9 55.5

KRBF 56.8 51.3 53.9

Table 3. Compared with other systems.

System P/% R/% F1/%

Feature-based method

This paper proposed 65.0 55.3 59.8

Miyao et al. [5] - - 59.5

Liu et al. [6] 63.4 44.1 52.0

Niu et al. [4] 43.2 70.2 53.5

Bui et al. [7] 55.3 68.5 61.2

Convolution kernel-based method

Qian et al. [1] 59.1 57.6 58.1

Kim et al. [2] 61.4 53.3 56.6

Airola et al. [3] 52.9 61.8 56.4

Composite kernel method

Sætre et al. [8] 64.3 44.1 52.0

Sætre et al. [9] - - 64.2

4. Conclusion

In this paper, in accordance with the principle which is

searching for the optimal kernel function for two single

kernel functions with the worst performance instead of

the two of kernel functions. We find the optimal concen-

tration of convex combination kernel function in the

kernel function set which is composed of RBF kernel,

polynomial kernel function and linear kernel function

three single-core kernel function. Experiments on AIMed

corpora, with this optimal convex combination of kernel

function trained classifier model can improve perfor-

mance of the PPI extraction compared to single kernel

function classification model, and obtained the highest

accuracy rate of 65.0% in the same system. In the next

step’s work, we will look for more effective features, and

find out other work to solve a single kernel function li-

mitation in the PPI extraction method based on the cha-

Protein-Protein Interaction Extraction Based on Convex Combination Kernel Function

racteristic method.

5. Acknowledgements

This research was supported by the National Natural

Science Foundation of China (NSFC) under Grant Nos.

61175068, No.6126041.

REFERENCES

[1] L. H. Qian and G. D. Zhou, “Tree Kernel-Based Protein-

Protein Interaction Extraction from Biomedical Literature

[J],” Journal of Biomedical Informatics, Vol. 45, No. 3,

2012, pp. 535-543.

http://dx.doi.org/10.1016/j.jbi.2012.02.004

[2] S. Kim, J. Yoon, J. Yang and S. Park, “Walk-Weighted

Subsequence Kernels for Protein-Protein Interaction Ex-

traction [J],” MBC Bioinformatics, Vol. 11, 2010, p. 107.

[3] A. Airola, S. Pyys alo, J. Bjrne, T. Pahikkala, F. Ginter and

T. Salakoski, “All-Paths Graphkernel for Protein-Protein

Interaction Extraction with Evaluation of Crosscorpus

Learning [J],” BMC Bioinformatics, Vol. 9, No. S1, 2008.

[4] Y. Niu, D. Otasek and I. Jurisica, “Evaluation of Linguis-

tic Features Useful in Extraction of Interactions from Pub-

Med; Application to Annotating Known, High-Through-

put and Predicted Interactions in I2D [J],” Bioinformatics,

Vol. 26, No. 1, 2020, pp. 111-119.

[5] Y. Miyao, et al., “Evaluating Contributions of Natural

Language Parsers to Protein-Protein Interaction Extrac-

tion [J],” Bioinformatics, Vol. 25, 2009, pp. 394-400.

http://dx.doi.org/10.1093/bioinformatics/btn631

[6] B. Liu, L. H. Qian, H. L. Wang and G. D. Zhou, “Depen-

dency-Driven Feature-Based Learning for Extracting Pro-

tein-Protein Interactions from Biomedical Text,” Procee-

dings of COLING’, Poster, 2010, pp. 757-765.

[7] Q.-C. Bui, S. Katrenko and P. M. A. Sloot, “A Hybrid

Approach to Extract Protein-Protein Interactions [J],” Bi-

oinformatics, Vol. 27, No. 2, 2011, pp. 259-265.

http://dx.doi.org/10.1093/bioinformatics/btq620

[8] R. Sætre, K. Sagae and J. Tsujii, “Syntactic Features for

Protein-Protein Interaction Extraction,” Proceedings of

LBM’07, Vol. 319, 2007, pp. 6. 1-6.14.

[9] R. Sætre, K. Yoshida, M. Miwa, T. Matsuzaki, Y. Kano

and J. Tsujii, “Extracting Protein Interactions from Text

with the Unified AkaneRE Event Extraction System [J],”

IEEE/ACM Transactions on Computing Biological Bioin-

formatics, Vol. 7, No. 3, 2010, pp. 442-453.

http://dx.doi.org/10.1109/TCBB.2010.46

[10] C. Plake, et al., “Optimizing Syntax Patterns for Disco-

vering Protein-Protein Interactions,” Proceedings of the

ACM Symposium on Applied Computing, 2005, ACM

Press, New York, pp. 195-201.

[11] S. Kim, et al ., “Wal k -Weighted Subsequence Kernels for

Protein-Protein Interaction Extraction [J],” BMC Bioin-

formatics, Vol. 11, 2010, p. 107.

http://dx.doi.org/10.1186/1471-2105-11-107

[12] R. Sæt re, et al., “Extracting Protein-Interactions fro m Text

with the Uniﬁed AkaneRE Event Extraction System [J],”

IEEE/ACM Transactions on Computational Biology and

Bioinformatics, Vol. 99, 2010, pp. 442-453.

http://dx.doi.org/10.1109/TCBB.2010.46