HEALTH, 2009, 1, 17-23
Published Online June 2009 in SciRes.
Incorporating heterogeneous biological data sources in
clustering gene expression data
Gang-Guo Li1, Zheng-Zhi Wang1
1Institute of Automation, National University of Defense Technology, Changsha 410073, Hunan, China
Received 13 April 2009; revised 13 May 2009; accepted 14 May 2009.
In this paper, a similarity measure between
genes with protein-protein interactions is pro-
posed. The chip-chip data are converted into the
same form of gene expression data with pear-
son correlation as its similarity measure. On the
basis of the similarity measures of protein-
protein interaction data and chip-chip data, the
combined dissimilarity measure is defined. The
combined distance measure is introduced into
K-means method, which can be considered as
an improved K-means method. The improved
K-means method and other three clustering
methods are evaluated by a real dataset. Per-
formance of these methods is assessed by a
prediction accuracy analysis through known
gene annotations. Our results show that the
improved K-means method outperforms other
clustering methods. The performance of the
improved K-means method is also tested by
varying the tuning coefficients of the combined
dissimilarity measure. The results show that it is
very helpful and meaningful to incorporate het-
erogeneous data sources in clustering gene
expression data, and those coefficients for the
genome-wide or completed data sources should
be given larger values when constructing the
combined dissimilarity measure.
Keywor ds: statistical analysis; similarity/dissimilarity
measure; gene expression data; clustering; data
With the rapid development of microarray technology in
the past few years, it is possible to monitor the expres-
sion levels of thousands of genes simultaneously [1].
Data from microarray experiments are denoted as a ma-
trix of genes by experimental conditions, where the con-
ditions are usually either a set of tissues or consecutive
time points in some environmental changes. The amount
of microarray data is very large. And microarray data are
mostly very redundant since many genes do not work
alone but are expressed together and interact to each
other. Thus, it is important to analyze the gene expres-
sion data.
Various clustering methods have been applied to the
analysis of gene expression data. Hierarchical clustering
[2] method becomes one of the most widely used tech-
niques for the analysis of gene expression data with the
advantages of simplicity and visualization of the results.
K-means clustering [3] is a good alternative to hierar-
chical clustering method when there is some prior
knowledge about the number of the clusters hidden in
the data. Hierarchical clustering and K-means clustering
are both global clustering methods which use the full set
of dimensions to measure the similarity and construct the
same global feature space for all clusters, and therefore
these clustering methods are likely to fail when they are
used for high dimensional data. Many researchers at-
tempted to solve this problem by different methods.
Self-Organizing Map (SOM) method was introduced to
interpret the patterns of gene expression [4]. Principle
component analysis method as a feature transformation
technique was used to preprocess expression data [5].
Support Vector Machine (SVM) method was also widely
used to identify sets of genes due to its good learning
performance for high dimensional data [6,7].
However, with the development of various genomic
knowledge resources (e.g., characterized protein-protein
interactions, functional annotations based on Gene On-
tology (GO), and transcription factors binding sites),
there comes a need to integrate the gene expression data
with these genomic knowledge resources for finding
some patterns with more biologically meaning.
In the recent years, some researchers have done some
work on the combination of gene expression data with
some prior knowledge. Reference [8] proposed the con-
struction of distance function which combines informa-
tion originating from microarray assays and biological
networks. The derived distance function was used fur-
ther to perform joint clustering of genes and vertices of
the network. Reference [9] introduced a machine learn-
18 G. G. Li et al. / HEALTH 1 (2009) 17-23
SciRes Copyright © 2009 HEALTH
ing approach to information fusion which allows inte-
gration of heterogeneous genomic data. Their procedure
may be seen as a generalization of well-known SOM
algorithm. Reference [10] defined a dissimilarity meas-
ure for GO knowledge, and combined GO knowledge
with gene expression data. They introduced their com-
bined dissimilarity measure into Partitioning Around
Medoids (PAM) [11] algorithm. In this paper we propose
a general framework to combine the gene expression
data with heterogeneous genomic knowledge resources
in order to enhance biological relevance of clustering
gene expression data.
2.1. Protein-Protein Interactions Databases
Protein-protein interactions assemble the molecular ma-
chines of the cell and underlie the dynamics of virtually
all cellular responses, while genetic interactions reveal
functional relationships between and within regulatory
modules [6]. The sum of all such interactions defines the
global regulatory network of the cell. Proteomic and
functional genomics platform technologies now generate
large datasets of protein and genetic interactions, but
these datasets vary widely in coverage, data quality, an-
notation and availability. And so there are many different
interaction databases such as BioGRID [12], DIP [13],
BIND [14], MINT [15], and MIPS [16]. We use the data
of BioGRID for our research. BioGRID is a general re-
pository for interaction datasets. The recent version
2.0.34 release of BioGRID is a fully integrated cross-
species database that supports most major model organ-
isms, with increased data content and improved func-
2.2. Chip-Chip Data
Chip-chip data are transcription factor-DNA binding
data that are generated as a result of the chip-chip assay
technique. This technique is used to determine whether
proteins including transcription factors will bind to par-
ticular regions of the chromatin within living cells. Two
genome-wide chip-chip data sources produced by the
technique are now available. One contains information
regarding the binding of 106 yeast transcription factors
[17]. The other one is a similar yeast dataset for a larger
number of transcription factors [18]. Both the data
sources are represented in the form of a confidence value
(statistical p-value) of a transcription factor attaching to
the promoter region of a gene. We use the chip-chip data
source with a larger number of yeast regulators for our
research, which contains information regarding the
binding of 203 regulators to their respective target genes
in rich medium. Besides rich medium, 84 regulators are
profiled in at least one environmental condition other
than rich medium.
3.1. Similarity Measure between Genes with
Protein-Protein Interactions
Firstly we transfer protein-protein interactions into an
interaction matrix. For the whole gene set, which has N
genes, the interaction matrix is a N×N matrix M. If there
is an interaction between two different genes, gene i and
gene j, Mij=1. If there is no interaction between two dif-
ferent genes, gene i and gene j, Mij=0. And a gene i al-
ways can be considered as having an interaction with
itself, so Mii=1.
And then the similarity between genes with protein-
protein interactions could be calculated as follows:
),( (1)
3.2. Similarity Measure of Gene Expression
There are many ways to measure similarity between gene
expression data. We choose the pearson correlation to
represent similarity between gene expression data. And
the pearson correlation, rexpr, is expressed as follows:
Here, and denote the expression levels of
gene expression data and at experimental condi-
tion i, respectively. M is the total number of the experi-
mental conditions.
u and v are the mean values of
the expression levels of and , respectively.
3.3. Similarity Measure of Chip-Chip Data
Suppose the chip-chip data represent the binding affinity
of transcription factors to promoters of each of
genes. If is the p-value for rejecting the null-hy-
pothesis that transcription factor does not bind the
promoter of gene , we define the binding score of tran-
scription factor to promoter of gene as
)log(log( min
, where is the minimum
of all p-values.
,),, 21 iixx(iN
denotes the
complete binding profile for gene i. By doing so, the
chip-chip data are presented in the same form of gene
expression data. Thus we use the pearson correlation as
the similarity measure for the chip-chip data.
3.4. Combined Dissimilarity Measure
If similarity measure has been defined, dissimilarity
G. G. Li et al. / HEALTH 1 (2009) 17-23 19
SciRes Copyright © 2009 HEALTH
measure can be calculated on the basis of similarity
measure. In this paper we calculated dissimilarity meas-
ure as follows:
simdis 1 (3)
Here, and mean the dissimilarity and
similarity measure respectively. We propose to combine
gene expression data with protein-protein interaction
data and chip-chip data. So the combined dissimilarity
measure can be expressed as follows:
dis sim
PIPchiprcom disdisdisdis 32exp1
Here, , and are the dissimi-
larity measures for gene expression data, chip-chip data
and protein-protein interaction data respectively. They
are all assumed to take values in [0,1]. Coefficients
disexp chip
dis PIP
and 3
present the weights of different dissimilar-
ity measures in the combined dissimilarity measure.
Thus they should satisfy: λ1+2
=1 and they are
tuning parameters from [0,1]. Coefficients 2
and 3
quantify the influence of chip-chip and protein-protein
interactions knowledge on the clustering process. Note
that 1
=1, 2
=0 and 3
=0 means that the clustering is
decided only by gene expression data.
3.5. Annotation Prediction by Cluster
Annotation prediction of novel genes is one of the initial
and useful applications for gene clustering results. Intui-
tively if an unexpectedly large number of genes in a
cluster belong to a specific functional category ‘F’, then
genes in this cluster are more likely to be relevant to
function ‘F’. Suppose a total of G genes in the genome
are analyzed in the microarray experiment among which
m genes are known to belong to a particular functional
category ‘F’. Within a cluster of size D genes, h genes
belong to the functional category ‘F’. Under the null
hypothesis that annotated genes are randomly distributed
in clusters, h follows a hypergeometric distribution [19].
The p-value (i.e. the probability of observing h or more
annotated genes in the cluster) is calculated as:
1][ (5)
Intuitively unexpected large h will result in small p-
value indicating that majority of the genes in the cluster
might belong to the functional category ‘F’. Given a
pre-defined threshold Δ which is determined after multi-
ple comparison correction, all genes in the cluster are
assigned (predicted) to ‘F’ if its p-value is less than Δ. It
is noted that a cluster can be annotated to more than one
functional category by this procedure.
3.6. Knowledge-Based Cluster Evaluation
Many methods for cluster evaluation have been proposed
in the recent years. Reference [20] presented a survey of
cluster validation techniques including internal and exter-
nal indices and also explained specific biases intrinsic for
different evaluation measures. Among these methods there
are some utilizing annotations as the basis of knowl-
edge-based assessment. Reference [21] introduced a
knowledge-driven cluster evaluation method based on va-
lidity indices that incorporate similarity knowledge origi-
nating from the GO. Reference [22] also gave a cluster
evaluation method on the basis of pooling results from
various cluster number K and calculating functional pre-
diction accuracy according to the functional annotated
genes belonging to M known distinct functional categories.
In this paper, we use the knowledge-based cluster evalua-
tion method of [22], so that we avoid the problem of esti-
mating of the cluster number K.
As described in [22], the prediction accuracy of a
clustering method for a given p-value threshold Δ is de-
fined as:
 KK
KKPMVPA )(/)()( (6)
Here, A(Δ) means the prediction accuracy under the
given p-value threshold Δ, VPK(Δ) is the total number of
functional annotated genes under a given K which are as-
signed to the correct functional category and whose p-
values are smaller than Δ, and PMK(Δ) is the total number
of the gene number of the clusters under a given K which
have those functional annotated genes assigned to the cor-
rect functional category with smaller p-values than Δ.
4.1. Dataset
We used the yeast cell-cycle data [23] which had listed
six known disjoint functional categories containing 104
genes. After we preprocessed and filtered the cell-cycle
data, we had 1703 genes left in which 86 genes belong to
the six known disjoint functional categories. Missing
values of the 1703 genes were imputed by LSimpute
method [24] and then the data of 1703 genes were nor-
malized. The yeast protein-protein interactions database
was downloaded from BrioGRID (http://www.thebiogrid.
org/downloads.php). The yeast database has 93122 pro-
tein-protein interactions. We neglected the differences
between one directional interaction and two directional
interactions and discarded any interactions involving
protein with itself. Then 50434 protein-protein interactions
were left. Among the 50434 protein-protein interactions,
only those interactions, whose two interacting genes were
both among the 1703 genes, were retained. Finally 3604
protein-protein interactions were obtained. The chip-chip
data was downloaded form the website [25].
20 G. G. Li et al. / HEALTH 1 (2009) 17-23
SciRes Copyright © 2009
varying from 2 to 20 for a pooled analysis of functional
prediction evaluation. The prediction accuracy is cal-
culated by the method mentioned in Section 2. In Fig-
ure 1 the curves of prediction accuracy (y-axis) of all
four clustering methods versus the total number of
prediction (x-axis) for varying p-value threshold Δ=
(0.03,0.02,0.015,0.01,0.005,0.001,0.0005,0.0001) are
presented. In all four clustering methods, the predictive
performance of Impkmeans with 1
=1/3 is the
best. The performance of PAM presents worse. And the
performances of HC and SOM are the worst.
4.2. Algorithms and Clustering Methods
We replaced the dissimilarity measure usually
used in the K-means algorithm by our combined dissimi-
larity measure (we call this method ‘Impkmeans’
later in this paper) and compared our method with other
clustering methods including hierarchical clustering
(HC), PAM, SOM using only the gene expression data.
4.3. Results
The four clustering methods are implemented with K
01000 2000300040005000 600070008000 900010000
total number of predition
prediction accuracy
Figure 1. Prediction accuracy curves of different clustering methods.
00.005 0.01 0.015 0.02 0.025 0.03
p-value threshold delta
prediction accuracy
case (a): 1,0,1
case (b): 0,1,0
case (c): 0.5,0.5,0
case (d): 0.5,0,0.5
case (e): 1/3,1/3,1/3
Figure 2. Prediction accuracy curves with different coefficient values.
G. G. Li et al. / HEALTH 1 (2009) 17-23 21
SciRes Copyright © 2009 HEALTH
Table 1 shows the minimum p-values with K from 2
to 20 when 1
=0, 2
=0 and 3
=1. Figure 2 pre-
sents four different curves of prediction accuracy (y-
axis) versus p-value threshold (x-axis)Δ = (0.03,0.02,
0.015,0.01,0.005,0.001,0.0005,0.0001) for varying
coefficients 1
and 3
among five cases (a)
=1, 2
=0 and 3
=0; (b) 1
=0, 2
=1 and 3
(c) 1
=0.5, 2
=0.5 and 3
=0; (d) 1
=0.5, 2
and 3
=0.5; and (e) 1
=1/3. The predictive
performance of case (e) is the best. The predictive
performances of cases (d) and (c) are better. The per-
formance of case (a) presents worse. And the per-
formance of case (b) is the worst.
Figure 3 also presents four different curves of predic-
tion accuracy (y-axis) versus p-value threshold (x-axis)
Δ= (0.03, 0.02, 0.015, 0.01, 0.005, 0.001, 0.0005, 0.0001)
for varying coefficients 1
and 3
among four
cases (a) 1
=1/3; (f) 1
=0.6, 2
=0.3 and
=0.1; (g) 1
=0.3 and 3
=0.2; and (h)
=0.5, 2
=0.4 and 3
=0.1. The predictive perform-
ance of case (h) is the best. The predictive performance
of case (g) is better. The performance of case (f) presents
worse. And the performance of case (a) is the worst.
DNA microarray technology has now made it possible to
simultaneously monitor the expression levels of thou-
sands of genes during important biological processes.
Clustering analysis seeks to partition the expression data
into groups based on specified features so that the
co-expression patterns hidden behind the expression data
can be found. But the final purpose of gene clustering
analysis is not finding the co-expression patterns but
finding some patterns with more biologically meaning.
Although the pure clustering methods without incorpo-
rating prior knowledge have been proven useful for
identifying biologically relevant patterns in some special
conditions, they perform poorly in other conditions. So it
is necessary to incorporate the prior knowledge for clus-
tering analysis. In our study, we incorporated the chip-
chip data and protein-protein interaction knowledge in
gene expression data, proposed Impkmeans method and
compared it with other generally used methods. As we
can see, Figure 1 shows that the Impkmeans method has
the best performance, which means that it is practicable
to construct the combined dissimilarity measure by
combining the dissimilarity measure of other biological
knowledge with the dissimilarity measure of gene ex-
pression data. When the combined dissimilarity was in-
troduced in K-means, we could obtain the effective
Impkmeans method. It also could be easily extended to
other clustering methods like PAM and HC. Maybe we
could obtain other more effective methods and better
Table 1. Minimum p-values with varying k.
K 2 3 4 5 6
p-value 0.96506 0.94805 0.93756 0.92456 0.91487
K 7 8 9 10 11
p-value 0.90789 0.89874 0.88726 0.86937 0.85162
K 12 13 14 15 16
p-value 0.82931 0.80737 0.78972 0.76531 0.73622
K 17 18 19 20
p-value 0.69944 0.64871 0.59468 0.53274
00.005 0.01 0.015 0.02 0.0250.03
p-value threshold delta
prediction accuracy
case (a): 1/3,1/3,1/3
case (f): 0.6,0.3,0.1
case (g): 0.5,0.3,0.2
case (h): 0.5,0.4,0.1
Figure 3. Prediction accuracy curves with different coefficient values.
22 G. G. Li et al. / HEALTH 1 (2009) 17-23
SciRes Copyright © 2009 HEALTH
As we can see in Table 1, the minimum p-values with
K varying from 2 to 20 are all far bigger than the p-value
threshold Δ used in our study. It shows that when 1
=0 and 3
=1, all the 86 annotation genes are clus-
tered in one cluster and this result has no biological sig-
nificance. This is partially because the interactions be-
tween proteins are complicated and one protein may
have many different interactions with different proteins.
As shown in Figure 2, the performance of cases (a) is
worse than those of cases (c), (d) and (e), which con-
firms that only using gene expression data is not enough
to infer more biologically meaningful patterns. That is to
say, incorporating some prior knowledge in clustering
gene expression data is helpful to obtain some more
biologically meaningful clusters. The performances of
cases (c) and (d) are worse than that of case (e), which
indicates that the more heterogeneous biological data
sources are incorporated in clustering gene expression
data, the better the result is. The prediction accuracy of
case (a) is higher than that of case (b), which implies that
the value of coefficient 1
should be larger than that of
when chip-chip data is combined with gene expres-
sion data. The prediction accuracy of case (c) is higher
than that of case (d), which implies that the value of co-
efficient 2
should be larger than that of 3
chip-chip data and protein-protein interaction data are
incorporated in clustering gene expression data.
As shown in Figure 3, the prediction accuracy
changes with varying the proportion of coefficients 1
and 3
. The performance of case (a) is worse than
that of other cases in Figure 3, which indicates the im-
portant levels of the three data sources are not the same
for constructing the combined dissimilarity measure, and
better results are obtained when the ratios of coefficient
to other coefficient are higher than 1. However, it is
not that the larger the ratios of coefficient 1
to other
coefficients are, the better the results are. This is proved
by the better results of case (g) compared with case (f)
(the ratios of 1
to 2
and 3
in case (g) are 2 and 6
respectively, while the ratios of 1
to 2
and 3
case (g) are 5/3 and 2.5). That is to say, the influence of
biological data sources on detecting biologically mean-
ingful clusters should not be neglected though the gene
expression data are the most important. Otherwise, the
performance of case (h) is better than case (g), which
shows that different biological data sources have differ-
ent important levels and the important level of chip-chip
is higher than that of protein-protein interaction data.
This is partially because that the chip-chip data is a ge-
nome-wide data source while protein-protein interaction
data is a partial data source.
In general, it is very helpful and meaningful to incor-
porate heterogeneous data sources in clustering gene ex-
pression data, and those coefficients for the genome- wide
or completed data sources should be given larger values
when constructing the combined dissimilarity measure.
The similarity measures for protein-protein interaction
and chip-chip data were constructed by transforming the
presentation forms of these data. They are combined
with the similarity measure for gene expression data
under a general framework. The combined similarity
measure was used to enhance the K-means method, with
which the gene patterns with more biologically meaning
were produced. This presents that our framework is use-
ful for clustering gene expression data to incorporating
heterogeneous biological data sources.
Otherwise, our general framework can be extended.
As we can see in (4), the combined dissimilarity measure
could be extended from the sum of three different dis-
similarity measures to the sum of M different dissimilar-
ity measures:
mjicomb ggdisggdis
Here, dism refers to m-th dissimilarity measure and
coefficient {λm}m=1,…,M satisfy: λ1+…+λM=1.
The dism may be a dissimilarity measure of some other
gene expression data, or may be a user defined dissimi-
larity measure of some other prior knowledge such as
GO, protein binding sites knowledge, and protein-DNA
interaction knowledge. For different research purpose,
choosing different gene expression data and different prior
knowledge, and fusing all these information together
when carrying out the clustering work, we could expect
that more biologically meaningful results would come.
The authors thank Dr. Yaohua Du for valuable discussions. And the
authors are very grateful to the anonymous referees who gave us help-
ful comments which improved the quality of this paper. This work is
partly supported by the National Nature Science Foundation of China
under Grant No. 60835005.
[1] D. Lockhart and E. Winzeler, (2000) Genomics gene
expression and DNA arrays, Nature, 405, 827-846.
[2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D.
Botstein, (1998) Cluster analysis and display of genome-
wide expression patterns, Proc. Natl Acad. Sci. USA, 95,
[3] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho,
and G. M. Church, (1999) Systematic determination of
genetic network architecture, Nature Genetics, 22,
G. G. Li et al. / HEALTH 1 (2009) 17-23 23
SciRes Copyright © 2009 HEALTH
[4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S.
Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R.
Golub, (1999) Interpreting patterns of gene expression
with self-organizing maps: Methods and application to
hematopoietic differentiation, Proc. Natl Acad. Sci. USA,
96, 2907–2912.
[5] S. Raychaudhuri, J. M. Stuart, and R. B. Altman, (2000)
Principal component analysis to summarize microarray
experiments: Application to sporulation time series, Pac.
Symp. Biocomput., 455–466.
[6] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W.
Sugnet, T. S. Furey, M. Jr. Ares, and D. Haussler, (2000)
Knowledge-based analysis of microarray gene expression
data by using support vector machines, Proc. Natl Acad.
Sci, 97, 262–267.
[7] H. Xia, A. Panaye, and B. T. Fan, (2007) Nonlinear SVM
approaches to QSPR/QSAR studies and drug design,
Current Computer-Aided Drug Design, 3, 341–352.
[8] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer, (2002)
Coclustering of biological networks and gene expression
data, Bioinformatics, 18, 145–154.
[9] J. Kasturi and R. Acharya, (2005) Clustering of diverse
genomic data using information fusion, Bioinformatics,
21, 423–429.
[10] K. Rafal and Z. Adam, (2006) Incorporating gene
ontology in clustering gene expression data, CBMS’06.
[11] L. Kaufman and P. Rousseeuw, (1990) Finding groups in
data: An introduction to cluster analysis, Wiley, New York.
[12] S. Chris, B. Bobby-Joe , R. Teresa, B. Lorrie, B. Ashton,
and T. Mike, (2006) BioGRID: A general repository for
interaction datasets, Nucleic Acids Research, Database
issue, 34, D535–D539.
[13] X. loannis, F. Esteban, S. Lukasz, D. Xiaoqun, T.
Michael, M. Edward, and E. David, (2001) DIP: The
database of interacing proteins: 2001 update, Nucleic
Acids Research, 29, 239–241.
[14] C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, M.
Bajec, K. Bantoft, D. Betel, B. Bobechko, K. Boutilier,
and E. Burgess, (2005) The biomolecular interaction
network database and related tools: 2005 update, Nucleic
Acids Res., 33, D418–D424.
[15] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G.
Ausiello, M. Helmer-Citterich, and G. Cesareni, (2002)
MINT: A molecular INTeraction database, FEBS Lett.,
513, 135–140.
[16] H. W. Mewes, C. Amid, R. Arnold, D. Frishman, V.
Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N.
Strack, and V. Stumpflen, (2004) MIPS: Analysis and
annotation of proteins from whole genomes, Nucleic
Acids Res, 32, 41–44.
[17] C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K.
D. Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne,
D. B. Reynolds, J. Yoo, et al., (2004) Transcriptional
regulatory code of a eukaryotic genome, Nature, 431,
[18] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z.
Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison,
C. M. Thompson, I. Simon, et al., (2002) Tanscriptional
regulatory networks in Saccharomyces cerevisiae,
Science, 298, 799–804.
[19] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho,
and G. M. Church, (1999) Systematic determination of
genetic network architecture, Nature Genetics, 22,
[20] J. Handl, J. Knowles, and D. Kell, (2005) Computational
cluster validation in post-genomic data analysis,
Bioinformatics, 21, 3201–3212.
[21] N. Bolshakova, F. Azuje, and P. Cunningham, (2005) A
knowledge-driven approach to cluster validity
assessment, Bioinformatics, 21, 2546–2547.
[22] A. Thalamuthu, M. Indranil, X. J. Zeng, and G. C. Tseng,
(2006) Evaluation and comparison of gene clustering
methods in microarray analysis, Bioinformatics, 22,
[23] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K.
Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B.
Futcher, (1998) Comprehensive identification of cell
cycle-regulated genes of the yeast Saccharomyces
cerevisiae by microarray hybridization, Mol. Biol. Cell, 9,
[24] B. Trond, D. Bjarte, and J. Inge, (2004) LSimpute:
accurate estimation of missing values in microarray data
with least squares methods, Nucleic Acids Research,
[25] Young Lab,