Paper Menu >>
Journal Menu >>
![]() J. Biomedical Science and Engineering, 2010, 3, 1061-1068 JBiSE doi:10.4236/jbise.2010.311137 Published Online November 2010 (http://www.SciRP.org/journal/jbise/). Published Online November 2010 in SciRes. http://www.scirp.org/journal/jbise Gene finding by integrating gene finders Yudong Cai1, Zhisong He2,3, Lele Hu4, Bing Li4, Yi Zhou4, Han Xiao4, Zhiwen Wang4, Kairui Feng7, Lin Lu5, Kaiyan Feng6, Haipeng Li2 1Institute of System Biology, Shanghai University, Shanghai, China; 2CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China; 3Department of Bioinformatics, College of Life Sciences, Zhejiang University, HangZhou, China; 4Department of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China; 5Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China; 6Division of Imaging Science & Biomedical Engineering, University of Manchester, Manchester, UK; 7Simcyp Limited, Blades Enterprise Centre, Sheffield, UK. Email: cai_yud@yahoo.com.cn; lihaipeng@picb.ac.cn Received 26 August 2010; revised 21 September 2010; accepted 30 September 2010. ABSTRACT Gene finding, the accurate annotation of genomic DNA, has become one of the central topics in bio- logical research. Although various computational me- thods (gene finders) have been proposed and devel- oped, they all have their own limitations in gene findings. In this paper, we introduce an integrating gene finder, which combines the results of several existing gene finders together, to improve the accu- racy of gene finding. Four integration schemes, based on majority voting, are developed for the analysis of two datasets – the basic dataset and the testing data- set. The basic dataset consists of 1500 DNA sequences and the testing dataset consists of 103 DNA sequences. It is demonstrated that a simple integration (a simple voting for each nucleotide) can significantly improve the finding performance, and removing confusing gene finders, caused by poor performance or redun- dant results, is important for a further improvement of the integration. The best prediction results are ob- tained using weighted majority voting, aided by the mRMR (Minimum Redundancy Maximum Relevan- ce) (Peng, 2005) method for the gene finder selection. The prediction accuracies are 84.16% and 90.06% for the basic dataset and testing dataset respectively, which are better than any individual gene finding software in our research. Keywords: Gene Finding; Intergration; mRMR 1. INTRODUCTION Genomes are sequenced very rapidly in recent years due to the advance of the sequencing technique. However, the interpretation of the sequences, including the accu- rate annotation of genomic DNA, has proved to be a much more difficult task. Although experimentation can be performed to detect genes, the process is lengthy and tedious. The most important and widely used methods today for gene finding are the computational ones. However, lack of accuracy is the main problem for gene prediction. These gene finders have various mechanisms for finding genes: some are based on evidence-based gene finding, i.e., using existing known genes or protein sequences to search and find the unknown genes, e.g. computational gene finders Genie [1] and GeneParser3 [2]; some are using the intrinsic genetic signals such as the splice sides, start and stop codons for the gene find- ing, e.g. Genie, Genmark [3], GeneID [4], and Genescan [5]. As far as the algorithms are concerned, some use artificial neural network such as the Genie, many use the Hidden Markov Models (HMM) such as Genie, Genscan and Fgenesh [6], some use dynamic programming such as the Fgenesh and Genie, and some are aided by Blast [7] such as the Twinscan software [8]. Though there is severe algorithm and strategy overlap between the gene finders, there is also huge discrepancy between them in the algorithms and the detailed operations. In the paper, we introduce a voting strategy to integrate the results of the gene finders to strengthen the gene finding and pre- diction capability. The Condorcet Jury Theorem proved that the judg- ments of a committee are superior to those of individuals [9]. Such a theorem also holds true for the prediction and classification algorithms since many investigations have found that integrating decisions from hybrid decision ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1062 (classification) algorithms can significantly improve the prediction, classification and recognition performance. These investigations include character and handwriting recognition [10-14], image analysis and segmentation [11,15,16], automated credit card slip processing [17], speaker identification [18], and other applications [19- 21]. All these multi-predictor integrations make use of all the candidate predictors for their applications. Se- lecting all candidate predictors is simple and straight- forward. However, similar predictors may strengthen each other and dominate the decisions, e.g. if the same algorithm is used twice, the decision will be biased to- wards that algorithm. Similarly, if too many algorithms from the same kind are involved in the decision, the de- cision will be biased towards that kind of algorithms. To avoid this problem, redundant gene finders (redundancy overtakes the complementary capability) should be re- moved from the integration. mRMR method [22] is originally developed for feature selection and analysis, which is transferred into the selection of gene finders in the research. On the other hand, better performed gene finder should be weighted higher, and vice versa. There- fore, we first select complementing gene finders through mRMR method, then use majority voting with weighting to integrate the filtered gene finders. The weight of each gene finder is set to be the AC values (defined in Mate- rial and Method), which is used to represent the finding accuracy. In the research, we choose eight famous and classical gene-finding softwares from genefinding.org (http:// www.ge nefinding.org/software.html, see below for de- tails). In order to compare simple integration system with the further refined integration system, four schemes are implemented, which are Simple Majority Voting with Simple Software Selection (SMV_SSS), Weighted Ma- jority Voting with Simple Software Selection (WMV_ SSS), Simple Majority Voting with mRMR Software Selection (SMV_MSS) and Weighted Majority Voting with mRMR Software Selection (WMV_MSS). In our research, WMV_MSS performed best. Its prediction accuracies reached 84.16% and 90.06% for the basic dataset and testing dataset. The weight of the gene find- ers is set to be the gene finding accuracy defined below. Simple software selection purely takes the software with higher accuracy first without considering the redundancy, while mRMR software selection takes consideration of the software redundancy as well as the accuracy. These schemes are introduced in detail below. 2. MATERIALS AND METHODS 2.1. Data Preparation The datasets we study consist of two parts: the basic dataset and the testing dataset. The basic dataset is taken from The University of Maryland Center for Bioinfor- matics and Computational Biology (http://www.cbcb. umd.edu/research/genefinding.shtml#genedata), with 1500 sequences from human beings. 767 sequences of them are translated using the forward chains, while the other 733 are translated using the reverse chains. The DNA sequences in the HMR195 dataset (http://www.cs.ub c.ca/ ~rogic/evaluation/dataset.html) are taken as the testing dataset. The testing dataset contains 195 gene sequences, among which 103 are from human, 82 are from the mice, and the rest 10 sequences are from the rats. All these 195 sequences are translated forward, and entered the Gen- Bank after August 1997. The basic dataset is used for testing and obtaining the TAC values (refer to “Accuracy Test” below) for each gene finder, while the testing dataset is used only for testing the integration. Because the TAC values are fed back for the integration through weighting in some integration schemes, the gene finding results might be biased when using the basic dataset. However, since the finding accuracy is rather stable es- pecially with a large dataset, the bias is neglectable. For scrutiny, a testing dataset is independently used for test- ing by taking the gene finder accuracy from the basic dataset. 2.2. Software for the Gene-Finding In this study, we chose 8 gene-finding softwares from genefinding.org. They are Augustus [23], Fgenes [6], Fgenesh [6], Genscan [5], Geneid [4], Genie [1], HMMgene [24] and Twinscan [8]. Please refer to sup- plemental material 1 for the brief description of these 8 softwares. 2.3. Data Encoding and Decoding The predicted protein sequences need to be encoded with digital numbers so that the predicted protein sequences can be integrated. Because the protein sequences are translated by the DNA sequences, we code the DNA sequences so that the predicted protein sequence can be uniquely decoded back by the coded DNA sequences. In the coding method, number 0, 1, 2 and 3 are used to represent the state of every nucleotide in a sequence. Number 0 means the nucleotide is in the noncoding re- gion, while number 1, 2 and 3 means this nucleotide is in a codon: number 1 means that the nucleotide is in the first position of the codon, number 2 means it is in the second position of the codon and number 3 indicates the last position. The encoded sequence can be easily de- coded and translated back into a protein sequence: firstly, nucleotides marked with 0 are removed, then every codon (three consecutive nucleotides) marked with “1, 2, 3” are translated into amino acid. In summary, every DNA sequence is coded according to the predicted se- ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1063 quences from every software, then the eight coded se- quences from the eight software are integrated using the four integration schemes described below, finally the encoded integrated sequences could be decoded to ob- tain the final translated protein sequences. 2.4. Accuracy Test In our study, we used ClustalW2 [25] to align the real coded protein sequences with its predicted coded protein sequences. Base on this alignment, the AC values, rep- resenting the accuracy of the gene finding, can be calcu- lated as follows. Assume the number of amino acids matched in the alignment is n, the length of real protein coded by this sequence is r l, the length of predicted protein is p l, and length of the longest protein that is possibly trans- lated from a DNA with such a length is M l. The ap- proximate correlation [26] is used to represent the gene finding accuracy in the research, which can be estimated as follows: TP n, P F Pl n, M PR TNln ll, R F Nl n The average conditional probability (ACP) is defined as: 1 4 TP TP TNTN ACP TP FNTP FPTNFNTNFP The Approximate Correlation (AC) of the gene finding is calculated as: AC = 2 (ACP – 0.5). AC = –1, 0, 1 indicate a total nega- tive, a random, and a perfect finding. Assume the number of sequences in the data set is N, the total accuracy of prediction of the software of inte- gration scheme is: 1 N i i TACAC N Take the sequence “ENST00000013916” in the basic dataset predicted by Genscan as an instance (See Sup- plemental Material 2 for further detail): the length of this DNA sequence is 22214 bp, so the length of the longest protein which is possibly translated by a 22214 bp DNA is 22214 / 37404.67 M l. Based on the CDS (Coding Sequence) of the DNA, we can get the real protein se- quence of its translation. Its length is 558 R l. The predicted sequence by Genscan is also translated into protein sequence. Its length is 908 p l. With these two protein sequences, ClustalW2 is used to align them to get the number of the matching amino acids, which is 550n (Please refer to Supplemental Material 2 for the alignment results). With all these parameters, ap- proximate correlation is calculated as0.8845AC . 2.5. mRMR (Minimum Redundancy Maximum Relevance) Maximum Relevance, Minimum Redundancy method (mRMR) [22] is originally developed by Peng for mi- croarray data processing. mRMR method requires the input data to be numeric vectors–each vector is taken as a mRMR feature. mRMR ranks each feature according to both its relevance to the target (highly related to the prediction accuracy) and the redundancy between the features. A “good” feature is characterized by maximum relevance with the target variable and minimum redun- dancy within the features. Both relevance and redun- dancy are defined by mutual information (MI), which estimates how much one vector is related to another. MI is defined as follows: , ,,log pxy I xy pxydxdy pxpy (1) x and y are two vectors; (, )pxy is the joint probabilis- tic density; ()px and ()py are the marginal prob- abilistic densities. Let Ω denote the whole vector set. The already-se- lected vector set with m vectors is denoted by s , and the to-be-selected vector set with n vectors is denoted by t . Relevance D of a feature f in t with a target variable c can be computed by Eq.2. ,DIfc. (2) Redundancy R of a feature f in t with all the features in s can be computed by Eq.3 1, is i f RIff m . (3) To maximize relevance and minimize redundancy, mRMR function is obtained by integrating Eq.2 and Eq.3: , 1 max,1, 2,..., jt is jji ff I fcIff jn m (4) Let the initial {} s i f where i f is vector from the best performed gene finder, and 121 1 { ,,...,,,...,} tiin fff ff by excluding onlyi f . Eq.4 is used to obtain one vector by another in totally 1n rounds, resulting a vector list with the selection order '' '' 01 1 , ,...,,..., hN Sff ff where h denotes at which round the feature is selected. In this research, mRMR method is used to rank and select the 8 softwares. The predicted results are coded by number 0,1,2,3, as is described above, and become nu- ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1064 meric vectors. The real coded protein sequence is re- garded as the target vector, and each gene finder results in a numeric vector. From these vectors the 8 softwares can be ranked by the mRMR method. How to use these ranked softwares for the integration is described in the following sections. 2.6. Algorithm of Integration Scheme Four integration schemes are introduced in the section. 2.6.1. SMV_SSS and SMV_MSS SMV_SSS (Simple Majority Voting with Simple Soft- ware Selection), like its name, is the simplest integration scheme in the research. The coded real sequence and the coded predicted sequences are the input vectors of SMV_SSS. Therefore, there are 9 vectors, among which, 8 vectors come from the 8 softwares (called as software vectors), and the remaining 1 vector is the coded real protein sequence (called as real vector). AC (described above) can be calculated for each software vector. Firstly, a software list is obtained simply according to the AC values '' '' 01 1 , ,...,,..., hN Sff ff where the higher position the software is in, the higher AC value the soft- ware obtains. Next, we initialize the software set as ' 10 {}Sf and then add software within the software list one by one to the initial software set, resulting a series of software set 12 , ,...,8 N SSS N, where '' ' 01 1 { ,...,} ii Sfff , e.g. '''' 40123 {, , ,}S ffff. Because '' 201 {,}Sff only contains two software, the integra- tion result 2 () A CS is set to be the AC value of soft- ware ' 0 f , i.e., ' 20 () () A CSAC f. For the rest of the software sets ( 3,...,8) i Si, the state of each nucleotide (coded by 0, 1, 2 or 3), gaining the majority votes from the software in the software set, is set to be the state of the nucleotide. This can be formulated as follows: j p denotes the state of a nucleotide n predicted by soft- ware ' j f , then a counting function () s X a can be de- fined as follows: 1 0 s as Xa as where a and s are the state num- bers 0, 1, 2 or 3. The total voting counts for a state s is defined as 0 i s sj j CXp (5) The integrated state for nucleotide n using software set i S is set to be {0,1,2 ,3} ()arg max is sn Sn C , which is the state that gains the majority votes from all the softwares in i S. If two or more than two states gain the same votes, choose the state supported by the best software. Therefore, the whole sequence can be integrated using i S with the AC value being denoted as () i A CS . The AC value obtained by SMV_SSS is set to be highest AC values obtained by all the algorithm sets (2,...,8) i Si : _2 max N SMV SSSi i A CACS . The process of SMV_MSS (Simple Majority Voting with mRMR Software Selection) is exactly the same as the SMV_SSS except that the algorithm list is provided by mRMR method (please refer to 2.5). Unlike SMV_SSS, SMV_MSS not only consider the software performance, but also the redundancy between the soft- wares. Thus the next software added into the software set is optimized by both software performance and soft- ware redundancy. 2.6.2. WMV_SSS and WMV_MSS WMV_SSS (Weighted Majority Voting with Simple Software Selection) is similar to SMV_SSS except that the software is weighted by the AC value rather than being weighted equally. The total voting count in Eq.5 becomes ' 0 i s sj j j CXpACf , that is, each vote is weighted by the AC value. The rest is exactly the same as the SMV_SSS. As for the WMV_MSS, the readers can easily deduce how it processes. WMV_MSS is the most refined inte- gration system because it weights the software with the AC values after the software set is optimized by the mRMR method. 2.7. An Example for the Integration Methods This section demonstrates how to encode and integrate the prediction of sequence AB010281 in HMR195 data- set with individual predictors, and finally obtain the amino acid sequence using our integration methods. En- coding the prediction results of all the 8 softwares, we get the state matrix of this DNA sequence, where each vector of the matrix is the coded protein sequence from a gene finder (see supplemental material 3). Let us first demonstrate how to integrate the results using SMV_SSS. The votes for each nucleotide are first counted for each software set. For instant, the predicted states of the 28th nucleotide are: 1 0 1 1 0 1 1 0 where each state comes from each of the 8 softwares, respectively. If all 8 softwares are selected, we can see that 1 appears 5 times and 0 appears 3 times. So state 1 gets the majority votes and the integrated state of the 28th nucleotide is set to be 1. In this way, we could get the whole state sequence using SMV_SSS scheme when all softwares are included. Similarly, the votes can be counted when 3-8 softwares involve. Notice that the AC values for the 8 softwares are 0.8000 (Fgenesh), 0.7437 (Fgenes), 0.6487 (HMMgene), 0.6224 (Genie), 0.8008 ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1065 (Genscan), 0.7579 (Geneid), 0.7158 (Augustus), 0.7702 (Twinscan), and the software is added with descending AC value. For the WMV_SSS, each software is weighted by its AC value. Again, let us take the 28th nucleotide as an instance: the score of state 1 is 3.5448, and the score of 0 is 2.3147 when all 8 softwares are included. So the inte- gration state of this nucleotide is 1. In this way, we can get the whole state sequence using WMV_SSS scheme (see Supplemental Material 3). The mRMR software selection schemes process in the same way as the simple software selection schemes, except that the software is added in a different order (refer to Table 2 for the list). With the integrated state sequence and the DNA se- quence, we can get the predicted mRNA sequence. First remove all the nucleotides with state 0, then, remove the nucleotides with an incomplete reading frame. For in- stance, after removing all 0s in the SMV_SSS integra- tion state sequence with 8 softwares, we can see a state fragment “…12312123…” corresponding with the DNA fragment “…GACAGATG”. Obviously, the reading frame “AG” is incomplete, and they are removed. In this way, we can get the final predicted mRNA sequence (see Supplemental Material 3). The mRNA sequence is translated into a protein sequence by translating each codon into an amino acid. 3. RESULTS AND DISCUSSIONS 3.1. Prediction Results of the 8 Gene-Finding Softwares The basic dataset (1500 sequences) and the testing data- set (103 sequences) are input into the 8 gene-finding softwares (described above), and produce the predicted protein sequences. The TAC values (defined above) are calculated to rate the performance of the gene-finding softwares. These TAC values are shown in Table 1. The TAC values, obtained by the basic dataset, will be used as the weight of the corresponding software when inte- grating by schemes WMV_SSS and WMV_MSS. 3.2. Result of mRMR The feature vectors input into mRMR method consist of all the predicted results from the 8 software for the 1500 sequences–each sequence contains thousands of nucleo- tides–and the coded real protein sequence. By joining the coded sequences together, it results in nine 1405899- dimensional vectors. Since the complementation capa- bility of each gene finder is the main focus, the dimen- sion is deleted using the command “uniq” in Linux if the related nucleotide in that dimension has the same state number for all the 9 vectors. After the deletion, 94700- dimensional vectors remain as the input for the mRMR method (see Supplemental Materials 4 for details). The mRMR program used in this contribution is downloaded from website http://research.janelia.org/peng/ proj/mRMR/. As all of the input vectors are integer vec- tors, we specify the parameter 0t in the mRMR pro- gram to tackle the integral calculation. mRMR method outputs the list of the software (see Supplemental Material 5 for the parameter setting of mRMR program). The mRMR list is shown in Table 2. This list is used in the integration of the softwares as is described above. 3.2. Results of SMV_SSS, WMV_SSS, SMV_MSS and WMV_MSS Seven TAC values are obtained for each of the integra- tion schemes since each scheme integrates two to eight gene finders. The best TAC value of the Seven TAC values is taken as the TAC value of the integration scheme. Table 3 shows all the integration results. WMV_MSS performs best both in the basic dataset and the testing dataset since its TAC rates are 84.16% and 90.06% when six software are integrated. Figure 1 plots all the seven TAC values for the four integration sche- mes. Table 1. The performance of 8 gene-finding software. NO Software Training Set Test Set 1 Genscan 80.08% 84.50% 2 Fgenesh 80.00% 86.43% 3 Twinscan 77.02% 82.52% 4 Geneid 75.79% 76.45% 5 Fgenes 74.37% 77.12% 6 Augustus 71.58% 85.49% 7 HMMgene 64.87% 68.51% 8 Genie 62.24% 79.40% Table 2. The rank of 8 gene-finding software in mRMR. Rank Software 1 Fgenesh 2 Fgenes 3 HMMgene 4 Genie 5 Genscan 6 Geneid 7 Augustus 8 Twinscan ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1066 Table 3. The performance of four predictors. Accuracy Predictor Training Set Test Set The best individual software 80.08% 86.43% 2 80.08% 86.43% 3 81.30% 87.29% 4 81.94% 86.23% 5 83.55% 89.90% 6 83.14% 88.32% 7 83.34% 89.05% SMV_SSS 8 83.24% 88.85% 2 80.08% 86.43% 3 81.30% 87.29% 4 81.35% 87.29% 5 83.47% 89.87% 6 83.46% 90.04% 7 83.30% 89.07% WMV_SSS 8 83.23% 89.15% 2 80.00% 86.43% 3 83.24% 89.84% 4 82.98% 89.77% 5 83.94% 88.27% 6 83.90% 88.85% 7 84.00% 89.84% SMV_MSS 8 83.24% 88.85% 2 80.00% 86.43% 3 83.24% 89.84% 4 82.98% 89.77% 5 84.06% 88.63% 6 84.16% 90.06% 7 84.06% 89.84% WMV_MSS 8 83.23% 89.15% 4. DISCUSSIONS From Table 3, we can see that nearly all integrated gene finders are better than the best single gene finder, except that when four software are integrated using SMV_SSS on the testing dataset. Therefore, integrated gene finders are significantly better than single gene finder in gene finding. This indicates that gene finders are comple- menting each other for gene finding. From Figure 1(a), WMV_MSS performs consistently better than other integration schemes on processing the basic dataset, and in Figure 1(b), in majority cases WMV_MSS performs better than other integration schemes. Therefore, WMV_MSS is the best integrator among the four. SMV_MSS is the second best integrator for the basic dataset since in Figure 1(a) it performs consistently better than the rest two integration schemes. As for the testing dataset in Figure 1(b), it is hard to tell whether SMV_MSS or WMV_SSS is better. We cannot tell whether WMV_SSS or SMV_SSS is better for the basic dataset from Figure 1. And SMV_MSS performs better than SMV_SSS consistently in Figure 1(a), and also in the majority cases in Figure 1(b). Therefore, mRMR software selection does improve the integration results since WMV_MSS is better than WMV_SSS and SMV_MSS is better than SMV_SSS in the integration. Weighting contributes slightly towards the integration since WMV_MSS is slightly better than SMV_MSS in both basic and testing dataset, and WMV_SSS is slightly better than SMV_SSS in proc- essing the testing dataset. In SMV_MSS and WMV_ MSS, when the last two softwares Twinscan and Augus- tus are added into the integration, the prediction perfor- mance is worse (see Table 3 and Figure 1). Twinscan is the 3rd best software. However, this software can be treated as the combination of Genscan and BLAST (see Supplemental Material I for further detail). Therefore, Genscan and other softwares must have covered the pre- diction capacity provided by Twinscan, causing the re- dundancy of Twinscan. For Augustus, it is in the 5th po- sition in the software list according to the TAC values, and its algorithms are similar to some of the other soft- wares. So Augustus is also removed by the mRMR crite- ria. 5. CONCLUSIONS We introduce some integrating schemes for some gene finder softwares in this investigation. The results indicate that these gene finders are able to complement each other and a simple combination of them performs sig- nificantly better than the best individual gene finder. mRMR (minimum redundancy and maximum relevance) method is proved to be very important for the further improvement of the prediction performance. Assigning ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1067 (a) (b) Figure 1. The performance of 4 integration schemes in (a) training set and (b) test set. It plots all the seven TAC values for the four integration schemes. higher weights to better performed gene finders can im- prove the integration results slightly, but not as signifi- cant as the mRMR software selection. A further im- provement is expected if more complementing gene finders are included in the fusion, which should be in- vestigated in a future research. 6. ACKNOWLEDGEMENTS This work is supported by the basic research grant of Chinese Acad- emy of Science (KSCX2-YWR-112). REFERENCES [1] Kulp, D., Haussler, D., Reese, M.G. and Eeckman, F.H. (1996) A generalized Hidden Markov Model for the rec- ognition of human genes in DNA. Intelligent Systems for Molecular Biology, 4(2), 134-142. [2] Snyder, E. and Stormo, G. (1995) Identification of protein coding regions in genomic DNA. Journal of Molecular Biology, 248, 1-18. [3] Borodovsky, M. and McIninch, J.G. (1993) Parallel gene recognition for both DNA strands. Computational Chem- istry, 17, 123-133. [4] Guigo, R., Knudsen, S., Drake, N. and Smith, T.F. (1992) Prediction of gene structure. Journal of Molecular Biol- ogy, 226, 141-157. [5] C. Burge, S. Karlin, (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268, 78-94. [6] Salamov, A. and Solovyev, V. (2000) Ab initio gene ![]() Y. D. Cai et al. / J. Biomedical Science and Engineering 3 (2010) 1061-1068 Copyright © 2010 SciRes. JBiSE 1068 finding in drosophila genomic DNA. Genome Research, 10, 516-522. [7] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410. [8] Korf, I., Flicek, P., Duan, D. and Brent, M.R. (2001) Integrating genomic homology into gene structure pre- diction. Bioinformatics, 17, S140-S148. [9] Condorcet, N.C. (1785) Essai sur l’application de l’analyse à la probabilité des decisions rendues à la plu- ralité des voix. Imprimerie Royale, Paris. [10] Huang, Y.S. and Suen, C.Y. (1995) A method of combin- ing multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 90-94. [11] Lam, L., Huang, Y.S. and Suen, S.Y. (1997) Combination of multiple classifier decisions for optical character rec- ognition. In: Bunke, H. and Wang, P.S.P., Eds., Hand- book of Character Recognition and Document Image Analysis, World Scientific Publishing Company, New Jersey, 79-101. [12] Suen, C.Y., Nadal, C., Mai, T.A., Legault, R. and Lam, L. (1990) Recognition of totally unconstrained handwritten numerals based on the concept of multiple experts. Pro- ceedings of IWFHR, Montreal, 131-143. [13] Stajniak, A., Szostakowski J. and Skoneczny, S. (1997) Mixed neural-traditional classifier for character recogni- tion. Proceedings of SPIE - The International Society for Optical Engineering, 2949, 102-110. [14] Rahman, A.F.R., Alam, H. and Fairhurst, M.C. (2002) Multiple classifier combination for character recognition: Revisiting the majority voting system and its variation. Lecture Notes in Computer Science, 2324, 167-178. [15] Ho, T.K., Hull, J.J. and Srihari, S.N. (1992) Combination of decisions by multiple classifiers. In: Baird, H.S., Bunke, H. and Yamamoto, K., Eds., Structured Document Image Analysis, Secaucus, Springer-Verlag Inc., New York, 188-202. [16] Rohlfing, T., Russakoff, D.B. and Maurer, C.R. (2004) Performance-based classifier combination in atlas-based image segmentation using expectation-maximization pa- rameter estimation. IEEE Transactions on Medical Im- aging, 23, 983-994. [17] Paik, J., Jung, S. and Lee, Y. (1993) Multiple combined recognition system for automatic processing of credit card slip applications. Proceedings of the 2nd Interna- tional Conference on Document Analysis and Recogni- tion, IEEE Computer Society Press, California, 520-523. [18] Altincay, H. and Demirekler, M. (2000) An information theoretic framework for weight estimation in the combi- nation of probabilistic classifiers for speaker identifica- tion. Speech Communication, 30, 255-272. [19] Lam, L. and Suen, C.Y. (1997) Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE Transactions on Pattern Analysis, 27, 553-568. [20] Lam, L. and Suen, C.Y. (1997) A theoretical-analysis of the application of majority voting to pattern-recognition. Jerusalem, Israel, 418-420. [21] Rahman, A.F.R. and Fairhurst, M.C. (1997) Exploiting second order information to design a novel multiple ex- pert decision combination platform for pattern classifica- tion. Electronics Letters, 33, 476-477. [22] Peng, H.C., Long, F.H. and Ding, C. (2005) Feature se- lection based on mutual information: Criteria of max- dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 27, 1226-1238. [23] Stanke, M. and Waack, S. (2003) Gene prediction with a hidden-Markov model and a new intron Submodel. Bio- informatics, 19(Suppl. 2), ii215-ii225. [24] Krogh, A. (1997) Two methods for improving perform- ance of a HMM and their application for gene finding. In: Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. and Valencia, A., Eds., Proceedings of the 5th International Conference on Intelligent Systems for Mo- lecular Biology, AAAI Press, Menlo Park, 179-186. [25] Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., et al., (2007) “ClustalW2 and ClustalX version 2,” Bioinformatics, 23, 2947-2948. [26] Burset, M. and Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353-367. |