X. W. LI, F. ZHU
Copyright © 2013 SciRes. ENG
and parameters, and put forward a solution: ensemble
clustering. And they use Mesh ontology as knowledge
to improve the clustering, which contains a wealth of
knowledge of biology. This algorithm based on the
distance between the MeSH is proved to be better in
clustering results compared with other methods [13].
• Yuan Yinli proposed a revised fuzzy algorithm to
solve the problem that it is difficult to select hidden
node centers in the study of RBF network. And ap-
plied it to strengthen the robustness of the outliers by
the network with the effectiveness proved [14].
• Cuifang GAO proposed a new algorithms: CKFCM
(collaborative kernel fuzzy c-means clustering), in
which the function of collaborative relationship was
incorporated into kemel fuzzy c-means clustering
(KFCM). By enlarging the difference among the sam-
ples and implementing on several subsets can be pro-
cessed together with an objective function, CKFCM
achieves better classification and is effective cluster-
ing with better per formance [15].
• An improved algorithm of weighted fuzzy kemel
clustering (WFKCA) is proposed to overcome its
shortcoming of liability to stick to local optimum. To
reduce the possibility of local optimum the idea of
iterative self-organizing data analysis techniques al-
gorithm (ISOODATA) is introduced into the WFKCA,
and initial center vectors are adjusted by the interme-
diate results from splitting and/or merging of cluster-
ing centers. It achieves more stable performance of
clustering for using match-able measurement from
feature space, and increases the adjustment range of
clustering centers [16].
• Damodar Reddy Edla and Prasanta K. Janawe pro-
pose a new clustering algorithm which is based on
Voronoi diagram. The algorithm uses a real valued
function defined by the radii of Voronoi circles. This
function enables to deal with the inner points of the
clusters followed by the boundary points. The pro-
posed scheme is applied on various artificial and bio-
logical data. The experimental results of the proposed
method are also compared with K-means and a few
existing clustering techn iques [17].
• Take noise into account, there are several means to
deal with it. For example, Roman Sloutsky, Nicolas
Jimenez, S. Joshua Swamidass and Kristen M. Naegle
explore several methods of accounting for noise when
analyzing biological data sets through clustering [18].
4. Introductions to Bioinformatics Databases
• GenBank: A complete database of DNA sequences
contains almost all of the Protein sequences and DNA
sequences that have been found as well as the relative
paper. Each data record has a simple description, such
as scientific name, references, table of the feature and
the sequence itself.
• GDB: preserve and deal with the gene data for Hu-
man Genome Project, contains the human genome re-
gion, the human genome map and the genetic varia-
tion. It provides read or write access directly.
• PIR and PSD: An overall, annotated, no redundant
database for protein sequence, including some protein
sequences come from dozens of integrated genes.
Thus, almost 99% of the data have been classified in
a certain protein family. And cross-reference can be
achieved in the annotation.
• COG: Attempt on a phylogenetic classification of the
proteins encoded in 21 complete genomes of bacteria,
archaea and eukaryotes, constructed by applying the
criterion of consistency of genome-specific best hits
to the results of an exhaustive comparison of all pro-
tein sequences from these genomes. The database
comprises 2091 COGs that include 56% - 83% of the
gene produc t s from each of the complete bacterial and
archaeal genomes and ~35% of those from the yeast
Saccharomyces cerevisiae genome [19].
5. Conclusion
From the introd uction we can know that every clustering
algorithm has advantages, disadvantages and scope of
application. So it is necessary to analyze each clustering
algorithm to use them better. Thus the truth is that it is
useful to apply clustering algorithms on biological data.
What’s more, the deeper the research goes, the higher the
demand becomes. So we should also analyze each case to
know the exact requirement for the algorithms and then
improve the current algorithm to get a better result.
6. Acknowledgements
Xiaowan LI, ID Number: 1027401004, currently is an
undergraduate student of Computer Science and Tech-
nology School of Soochow University, majoring in com-
puter science and technology.
REFERENCES
[1] J.-G. Sun, J. Liu and L.-Y. ZHao, “Clustering Algorithms
Research.”
[2] M. X. Duan, 2009-5-1.
[3] L. M. Wang and X. D. Wang, “A Non-Parametric Bay-
esian Clustering for Gene Expression Data,” IEEE
Workshop on Statistical Signal Processing (SSP), Ann
Arbor, 5-8 August 2012, pp. 556-559.
[4] M. Zhang and J. Yu, “Fuzzy Partitional Clustering Algo-
rithms.”
[5] L. Wang, H. Peng, J.-S Hu and H.-F. Liang, “Fuzzy
Clustering Applied in Genetic Differentiation Analysis,”