### Paper Menu >>

### Journal Menu >>

J. Biomedical Science and Engineering, 2008, 1, 200-203 Published Online November 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE Searching maximum quasi-bicliques from protein-protein interaction network Hong-Biao Liu1, Juan Liu1 & Lian Wang1 1School of Computer, Wuhan University, Wuhan 430079, China. Correspondence should be addressed to Juan Liu (liujuan@whu.edu.cn). Received September 10, 2008; revised September 28, 2008; accepted September 28, 2008 ABSTRACT Searching the maximum bicliques or bipartite subgraphs in a graph is a tough question. We proposed a new and efficient method, Searching Quasi-Bicliques (SQB) algorithm, to detect maximum quasi-bicliques from protein-protein interaction network. As a Divide-and-Conquer method, SQB consists of three steps: first, it divides the protein-protein interaction network into a number of Distance-2-Subgraphs; second, by combining top-down and branch-and-bound methods, SQB seeks quasi-bicliques from every Distance-2-Subgraph; third, all the redundant results are removed. We successfully applied our method on the Saccharomyces cerevisiae dataset and obtained 2754 distinct quasi-bicliques. Keywords: Searching Quasi-Bicliques algo- rithm; Quasi-biclique; Protein-Protein Interac- tion Network; Distance-2-Subgraph; Di- vide-and-Conquer method 1. INTRODUCTION As high-throughput technologies such as Yeast Two-Hybrid [1] and Affinity Purification/ Mass Spec- trometry [2] have made significant progress, human be- ings have collected a great number of protein-protein interaction datasets. It is meaningful to dig out substruc- tures from large-scale protein interaction data. Biclique, one kind of the substructures, is common in pro- tein-protein interaction network. Biclique often contains useful biologically meaningful units. For example, the biclique shown in Figure 1 indicates an “all-versus-all” predicted interaction subnetwork [3], Where most of the edges, each representing a protein-protein interac- tion,were approved by biological experiments. Further- more, six proteins on the left side all contain SH3 domain and four proteins on the right side are all with the SH3-binding motifs. Therefore mining biclique can help biologists unveil the cellular function at the molecular level. However, mining bicliques from graph (or pro- tein-protein interaction network in this study) is a com- putationally intensive work, and has been proven as NP complete [4, 5, 6]. Although many researchers [7, 8, 9, 10, 11] have developed some algorithms to solve the maxi- mum biclique problem, they often focused on some spe- cial characteristics of the graph, so the problem is still intractable. Therefore, in computational biology field, some researchers mined quasi-bicliques instead of exact bicliques. Li [12] used the “frequent pattern” developed by Agrawal [13] to find “all-versus-all subnetwork” (or quasi-biclique). The “existing closed itemset mining al- gorithms” (proposed by Agrawal [13]) only uses the size constraint on transaction sets to decrease search space, which brings a great number of small maximum bicliques and greatly influences the process speed. Here, we propose Searching Quasi-Bicliques (SQB) algorithm to detect maximum quasi-bicliques from pro- tein interaction network. By means of Di- vide-and-Conquer method, SQB partitions the pro- tein-protein interaction network into a mount of Dis tance-2-Subgraphs , each for one vertex, and only con- taining two kinds of nodes: those being connected with the vertex (we call them the direct neighbors), and those being reachable from the vertex by passing just one other node (we call them distance-2-neighbors). Next, through top-down and branch-and-bound methods, SQB tries to find the quasi-bicliques from all the Distance-2- Subgraphs. At last, SQB merges the redundant ones in the quasi-biclique clusters. We applied our algorithm on the Saccharomyces cerevisiae dataset and obtained 2754 distinct quasi-bicliques. Yfr024c Rvs167 Ysc84 Ypr154w Bbc1 Las17 Ypr171w Acf2 Ynl094w Ygr136w SH3 proteinsSH3-binding proteins Figure1. An all-versus-all predicted interaction subnetwork. SciRes Copyright © 2008 H. B. Liu / J. Biomedical Science and Engineering 1 (2008) 200-203 201 SciRes Copyright © 2008 JBiSE The organization of this paper is as follows. Section 2 states the maximum Quasi-biclique problem. Section 3 describes the SQB algorithm for finding the maximum quasi bicliques. Section 4 reports the results of the ap- plication of SQB algorithm on in a real proteomic data. The paper ends with conclusions and the future work. 2. MAXIMUM QUASI-BICLIQUE PROBLEM We use a simple graph like [14] to describe a pro- tein-protein interaction network. A vertex represents a kind of protein and an edge means there is an interaction between two kinds of proteins. Quasi-biclique is a graph G= (V, E), in which V can be divided into two non empty sets {V1, U1} and every vertex in V1 directly links to nearly every vertex in U1. The question of finding maximum quasi-biclique in a graph G= (V, E) can be formalized as following function. ,)),(( )),((max nmEVGf EVGf = subject to VUVVUV VmnUmVn ⊂⊂=∩ ≤+== 1,1,11 |,||,1||,1| φ where |V| denotes cardinality of the vertex set of the input graph, n and m should be greater than 1 and lower than |V|-2. A quasi-biclique is measured by the value of nm which actually is the number of interacting edges between two sets. In the following, we denote a quasi-bicluque as QB (V1, U1). 3. SQB ALGORITHM The main method of SQB is Divide-and-Conquer, which includes three parts. The first one is to seek every ver- tex’s Distance-2-Subgraph from a graph. The second one is to find every vertex’s quasi-biclique from its Dis- tance-2-Subgraph. The third one is to merge solutions: after finding every vertex’s quasi-bicliques, SQB puts all the quasi-bicliques together, removes the similar ones, prunes the smaller ones, and obtains the quasi-bicliques of the whole graph. The three parts of SQB are detailed in the following. 3.1. Finding Distance-2-Subgraph As some graph, especially the biological protein-protein interaction network, is very large, the process on the graph will need a very large memory space so it is not feasible in common applications. But it is obvious that the distance between any two vertexes in a quasi-biclique is not greater than 2. So if we want to find a quasi-biclique which in- cludes a specific vertex, we only need to consider the vertex and its related neighbors. The related neighbors are vertexes which are less than 3 in distance to the specific vertex. The induced subgraph, which consists of the ver- tex and its related neighbors, is denoted as Dis- tance-2-Subgraph. The edge status between any two ver- texes in an induced subgraph is the same as that in the original graph. SQB needs to find every vertex’s Dis- tance-2-Subgraph in order to obtain its maximum quasi-biclique. 3.2. Detecting Maximum Quasi-bicliques After finding every vertex’s Distance-2-Subgraph, SQB begins to find the quasi-biclique. This process, detecting quasi-bicliques, is the essential part of SQB. SQB uses the size (nm) to measure a quasi-biclique and it is crucial to know the specific value of n and m of a maximum quasi-biclique. As n and m are in a limited range, SQB tests the values of n and m from the upper limit to the lower one. If a graph has a quasi-biclique QB(|V|=n, |U|=m), the vertexes in the graph with degree lower than n and m should not be in the QB, so SQB removes these smaller vertexes during the process. Furthermore, if the test value of n and m are greater, SQB can remove more vertexes and increase the speed of the process. Before explaining our program, we introduce how to split a graph. We use a complex data structure CD to store the V1 set, U1 set and the induced subgraph G of V1 set and U1 set. The program splits the graph at the V1 set, and U1 set in turn. The program chooses the vertex in V1 or U1 with largest degree and labels it so that next time, the program avoids splitting at the same vertex again. For example, if the program chooses v15 as the candidate vertex, it then produces four sets V150, V151, U150, and U151. The first set V150 includes v15 and vertexes in V1 which has a distance of 2 to vertex v15. The second set V151 consists of elements in V1 except v15. The third set U150 contains vertexes in U1 which is the direct neighbor of vertex v15. The fourth set U151 is the same as U1. Next, the program produces induced graph G150 which con- tains vertexes V150 and U150, then puts V150, U150, G150 into data structure CD150(V150, U150, G150). In the same way, it gets another data structure CD151(V151, U151, G151). The algorithm of detecting quasi-bicliques is listed in the end of this subsection. The algorithm consists of a FOR loop that begins from 20 to 2. (20 is an experimental value which should be increased with the growing of nodes of the input graph). At first SQB uses the sub-function Search_k_Quasi_Bicliques(G, k) to test whether the graph contains quasi-bicliques in which the |V|>20 and |U|>20. If the sub-function finds it’s true, the FOR loop terminates, otherwise the FOR loop decreases the test value by one and continues to test, until the sub-function finds quasi-bicliques or the test value lower than 2. The sub-function func Search_k_Quasi_Bicliques(G, k) is the key component of SQB. At first, the input graph G’s vertex set is divided into two parts, V1 and U1. Next, V1 and U1, and G are put into complex data structure CD(V1, U1, G). CD is put into a buffer BUFFER. Next, the pro- gram go into a WHILE loop. This loop’s terminate con- dition is that the buffer is empty. During the loop, first, the program removes one element from BUFFER and puts it into CD0(V01, U01, G0), then deletes vertex in V01 with 202 H. B. Liu / J. Biomedical Science and Engineering 1 (2008) 200-203 SciRes Copyright © 2008 JBiSE degree lower than k and deletes vertex in U01 with degree lower than k. Next, the program judges whether CD0(V01, U01, G0) is a quasi-biclique. If the CD0 is a quasi-biclique and the size is greater than the current maximum value, the program outputs CD0 and puts the current maximum value as CD0’s size. If it is not a quasi-biclique and its size greater than current maximum value, the program splits the CD0 into two units (CD01, CD02) and puts them into the BUFFER. Otherwise, the program discards the left smaller ones. Part of SQB Algorithm Input. Graph G(V, E) Output. The maximum quasi-bicliques in G(V, E) func Search_Maximum_Quasi_Bicliques() = for tt from 20 to 2 step-1 do OUT= Search_k_Quasi_Bicliques(G, tt) If OUT is not empty Output OUT and Stop the program end for end func Search_Maximum_Quasi_Bicliques() func Search_k_Quasi_Bicliques(G, k)= Divide vertex set V into two sets, V1 and U1 Put V1, V2, G into Complex Data Structure CD(V1, U1, G) Put CD into BufferBUFFER While(BUFFER is not empty) Remove first element in BUFFER to CD0(V01, U01, G0) Delete vertex in V01 with degree lower than k Delete vertex in U01 with degree lower than k If CD0(V01, U01, G0) is Quasi_Bicliques then output CD0 Else if CD0(V01, U01, G0) is not empty then Split CD0 into CD01, CD02 at one unlabeled vertex put CD01, CD02 into BUFFER end while end func Search_k_Quasi_Bicliques() 3.3. Pruning Redundant Quasi-bicliques Through the above steps, SQB obtains every vertex’s quasi-bicliques. As some vertexes might have similar quasi-bicliques, SQB needs to remove redundant ones and obtains the overall distinct quasi-bicliques of the whole graph. SQB deletes one between any two quasi-bicliques QB1(V1, U1) and QB2(V2, U2) if they meet the follow- ing rules: If V2))(U1U2)((V1||2))(U1 )21((⊆∩⊆⊆ ∩ ⊆UVV is true, then delete QB1. If ))12()12((||))1(U2 )12((VUUVUVV ⊆∩⊆⊆ ∩ ⊆ is true, then delete QB2. The first rule means that if V1 is V2’s subset and U1 is U2’s subset, or if V1 is U2’s subset and U1 is V2’s subset, then QB1 is a part of QB2, QB1 can be deleted. The second rule is opposite to the first one. Otherwise, if two quasi-bicliques match neither of the above two rules, SQB keeps both of them. After the pruning operation, SQB obtains the distinct quasi-bicliques of the whole graph and the biggest one is the optimum one of the whole graph. 4. APPLICATION OF SQB The experiment was done on our web server which con- sisted of two Pentium 2 PCs with 4.8 GHZ CPU and 2G RAM. The Saccharomyces cerevisiae dataset Y78, de- rived from [15], consists of 78,390 protein-protein inter- actions, including 5321 proteins. During our experiments, we removed vertexes with degree 1 because they could not produce a biclique. The input graph of our program is with node number 4546. At first, we produced 4546 dis- tinct Distance-2-Subgraphs according to every vertex’s neighbors and their neighbors. The maximum subgraph is with 3164 nodes and the average value is 746.1. So the questions are very tough. About eighty percent of the vertexes have a process time less than 20 seconds and half of them are processed within 1 second. The average time is about 13.4 seconds. Giving the large input graphs, the performance is very remarkable. During our experiments we predicted 5616 quasi-bi- cliques which include empty or redundant ones. A small number of vertexes have more than one maximum quasi-bicliques, so the number of quasi-bicliques is greater than 4546, the number of Distance-2-Subgraphs. Table 1. The Maximum Quasi-Bicliques (YPL212C). |V|=26 YBL038W YBR251W YBR283C YDL136W YDL191W YEL050C YGL103W YGL123W YGR034W YGR220C YIL018W YIL021W YJL063C YLR075W YLR344W YLR378C YMR260C YNL178W YNL284C YNR037C YOL040C YOL127W YOR063W YPL131W YPL183W-A YPR110C |U|=62 YBL087C YBL091C YBR031W YBR048W YBR146W YCR031C YDL061C YDL083C YDL140C YDL202W YDR012W YDR025W YDR101C YDR116C YDR226W YDR237W YDR418W YDR450W YEL054C YER117W YER170W YFL001W YGL063W YGL068W YGL135W YGL147C YGR085C YHR147C YIL133C YJL177W YJL190C YJL191W YKL009W YKL024C YKL170W YKL180W YLR244C YLR340W YLR367W YLR388W YML010W YML025C YML026C YMR143W YMR158W YMR188C YNL067W YNL069C YNL081C YNL177C YNL185C YNL306W YOR116C YOR150W YOR151C YOR207C YOR341W YPL212C YPL220W YPR010C YPR102C YPR166C H. B. Liu / J. Biomedical Science and Engineering 1 (2008) 200-203 203 SciRes Copyright © 2008 JBiSE 0 200 400 600 800 1000 1200 1400 1600 1800 01000 20003000 4000 5000 Vertex sorted by Size Quasi-biclique Size Figure 2. The size of maximum quasi-biclique of every vertex The size of quasi-biclique is measured by the prod uct of n=|V| and m=|U|. Figure 2 shows the size of every vertex’s maximum quasi-biclique. The average value is 115.8. The largest one is with size 1612=26*62 which includes protein YPL212C and is listed in Table 1. Every vertex in the protein set V touches every one in the set U. During our experiments, we thought the quasi-biclique with size lower than 4*4 is easy to occur by random and should be discarded. At last we obtained 2754 distinct maximum quasi-bicliques which are avail- able on our website http://biod.whu.edu.cn/pub/QuasiBi- clique.txt. 5. CONCLUSION In this study, we developed Searching Quasi-Bicliques (SQB) algorithm to detect maximum quasi-biclique from a large scale protein-protein interaction network. SQB uses the Divide-and-Conquer method to process data. Combination of Top-down method and the branch-and-bound method greatly reduces the search space. We successfully applied it to the analysis of Yeast proteomic data and obtained many quasi-bicliques, which might provide meaningful clues to potential biological users. Although we obtained some quasi-bicliques, they might not be global optimum ones because the results are influenced by the choice of start splitting vertex. In addi- tion, we might integrate more biological features, such as the proteins’ functional category, to analyze the mecha- nism of the quasi-biclique in the cell. ACKNOWLEDGEMENTS This work is partially supported by the National Nature Science Foundation of China under grant 3010. REFERENCES [1] S. Fields and R. Sternglanz. (1994) The two-hybrid system: an assay for protein-protein interactions, Trends Genet, 10(8), 286-292. [2] A.Bauer and B. Kuster. (2003) Affinity purification-mass spec- trometry. Powerful tools for the characterization of protein com- plexes, Eur J Biochem, 270(4), 570-578. [3] A.H.Tong, B.Drees, et al.( 2002) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules, Science, 295(5553), 321-324. [4] M.R.Garey, D.S. Johnson, et al.( 1976) Some simplified NP-complete graph problems, Theoretical Computer Science, 1(3): 237-267. [5] R.M. Karp. (1972) Reducibility among combinatorial problems, Complexity of Computer Computations, 43, 85-103. [6] S.Even and Y. Shiloach. (1975) NP-completeness of several ar- rangement problems, Dept. Computer Science, Technion, Haifa, Israel, Tech. Rep, 43. [7] J.A. Bondy and S.C. Locke (1986) Largest bipartite subgraphs in triangle-free graphs with maximum degree three, Journal of graph theory, 10(4), 477-504. [8] M. Grotschel and W.R. Pulleyblank (1981) Weakly bipartite graphs and the max-cut problem, Operations Research Letters, 1(23-27), 482. [9] F. Barahona. (1983) On some weakly bipartite graphs, Operations Research Letters, 5, 239242. [10] J.J. Hopfield and D.W. Tank (1985) "Neural" computation of decisions in optimization problems, Biological Cybernetics, 52(3), 141-152. [11] J.J. Hopfield. (1984) Neurons with Graded Response Have Collec- tive Computational Properties like Those of Two-State Neurons, Proceedings of the National Academy of Sciences of the United States of America, 81(10), 3088-3092. [12] H. Li, J. Li, et al.( 2006) Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scale, Bioinfor- matics, 22(8), 989-996. [13] R. Agrawal and R. Srikant (1994) Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, 487-499. [14] N. Pržulj, (2004) Graph theory approaches to protein interaction data analysis, in Knowledge Discovery in High-Throughput Bio- logical Domains, Interpharm/CRC. [15] C. von Mering, R. Krause, et al. (2002) Comparative assessment of large-scale data sets of protein-protein interactions, Nature, 417(6887), 399-403. |