Fuzzy clustering of time series gene expression data with cubic-spline

doi:10.4236/jbm.2013.13004

Paper Menu >>

Journal Menu >>

Journal of Biosciences and Medicines, 2013, 1, 16-21 JBM

http://dx.doi.org/10.4236/jbm.2013.13004 Published Online December 2013 (http://www.scirp.org/journal/jbm/)

OPEN ACCESS

Fuzzy clustering of time seri es gene expression data with

cubic-spline

Yu Wang, Maia Angelova, Akhtar Ali

Mathematical and Intelligent Modeling Lab, Northumbria University, Newcastle upon Tyne, UK

Email: yu.wang@northumbria.ac.uk, maia.angelova@northumbria.ac.uk, akhtar.ali@northumbria.ac.uk

Received September 2013

ABSTRACT

Data clustering techniques have been applied to ex-

tract information from gene expression data for two

decades. A large volume of novel clustering algo-

rithms have been developed and achieved great suc-

cess. However, due to the diverse structures and in-

tensive nois e, there is no reliable clustering approach

can be applied to all gene expression data. In this pa-

per, we aim to the feature of high nois e and propose a

cubic smoothing spline fitted for the time course ex-

pression profile, by which noise can be filtered and

then groups genes into clusters by applying fuzzy c-

means clustering on the resulting splines (FCMS).

The discrete values of radius of curvature are used to

compute the similarity between spline cu rves. Resu lts

on gene expression data show that the FCMS has

better perform a nce than the original fuzzy c-means

on reliability and noise robustness.

Keywords: Fuzzy c-Means; Cubic Spline; Noise; Radius

of Curvature

1. INTRODUCTION

With the development of DNA sequencing techniques

and microarray technology, genomic research has achi-

eved a great success. A wealth of biological data has

been extracted from microarrays. Analysis of these data

on the molecular level is revolutionary in medicine be-

cause they are highly informative. Innovative models are

needed instead of straightforward adaptations of existing

methodologies. Clustering of gene expression data pro-

vides an efficient way to extract information from these

la rg e -scale datasets [1]. The underlying assumption in

clustering gene expression data is that co-expression in-

dicates co-regulation, thus clustering should identify

genes that share similar functions.

Generally, there are two categories of gene expression

data: static and time series [2,3]. In static expression ex-

periments, a snapshot of the expression of genes in dif-

ferent samples is measured, while in time series expres-

sion experiments, a temporal process is measured. An-

other important difference between these two types of

data is that while static data from a sample population

are assumed to be independent identically distributed,

while time series data exhibit a strong autocorrelation

between successive points. Most previous works analyz-

ing time series expression used methods developed orig-

inally for static data by neglecting the time series cha-

racteristics [1]. More recently several new algorithms

specifically targeting time series expression data were

presented. A very popular procedure in time-series anal-

ysis is smoothing the data, which removes random varia-

tion and shows trends and cyclic components [4]. Bar-

Joseph et al. [5] used statistical spline estimation to

represent time-series gene expression profiles, however,

the method requires data that has been sampled at a suf-

ficiently high rate. In addition, cubic splines are used for

gene expression time-series, however no appropriate

similarity metric is adopted [6]. Later, Luan and Li [7]

proposed a mixed-effects model using cubic B-splines

for gene expression time-series. However, it is not al-

ways possible to define equally spaced knots if the series

are unevenly sampled.

In this paper, we focus on the time series characteris-

tics and proposed an integrate approach for clustering

time series microarray data. The approach was composed

of two steps: the first one is modeling gene expression

profiles by cubic spline curves. By tuning the smoothing

parameter, gene expression data can be smoothed with

statistical consideration. The second step is fuzzy c-

means on the radius of curvature of the smoothed curves

after discretization.

2. BACKGROUND

2.1. Fuzzy c-Means Algorithm

Large volumes of clustering algorithms have been ap-

plied to the analysis of gene expression data, such as

k-means [8], hierarchical clustering [9] and SOM [10].

However, most of these algorithms are restricted to a one

Y. Wang et al. / Journal of Biosciences and Medicines 1 (2013) 16-21

to one mapping strategy: one gene belongs to exactly one

cluster. In biology, genes can participate in more than

one genetic network and are frequently coordinated by a

variety of r egulatory mechanisms. To address this f eature,

fuzzy clustering is applied for gene expression data

analysis [11].

The fuzzy c-means clustering algorithm (FCM) is ac-

tually a variation of the k-means clustering algorithm,

which allows one object belong to more than one cluster

[12]. The FCM assigns a membership degree to each

object in the data [13]. The centroids are computed based

on the degree of memberships of data points. The algo-

rithm is an iterative optimization that minimizes the co st

function defined as follows:

(;)

CN m

ij ijji

Judx v

= =

∑∑

(1)

where C and N denote the number of clusters and data

points respectively,

represent the membership of

point j in the i th cluster,

is the i th cluster center,

which satisfy:

u≤≤

and

∑

is the dis-

tance between feature vector

and prototypes

The original formulation of FCM uses prototypes and

inner-product induced norm metric for

given by:

(;) ()()

ijji jijiji

dxvxvxvAxv=−=−−

(2)

The parameter m controls the fuzziness of the resulting

partition. If

1m→

, the fuzzy clustering will turn into

hard clustering. The prototypes are simply the means of

the clusters. If

m→∞

, the partition approaches max-

imal fuzziness, and a gene will be assigned to all clusters

equally.

In the FCM, we solve the optimization problem using

Lagrange multipliers [12]:

11 1

(; )(1)

CN C

ij ijjiij

ij i

Judx vu

= ==

= +−

∑∑ ∑

(3)

The problem is solved by iteratively updating degrees

of membership with ﬁxed centers and updating centers

with ﬁxed degrees of membership. The closed-form for-

mulas for updates are derived by taking the partial deriv-

atives with respect to both and setting them to zero. The

partition matrix and the cluster center of FCM are esti-

mated by (4) and (5).

21/( 1)

(1 /(,))

ijj i

ij cm

ijj i

d xv

−

∑

(4)

ij j

iNm

∑

(5)

2.2. Cubic Spline

In numerical analysis, Cubic-spine has been widely used

in image processing and computer graphics. A cubic

spline is a piecewise third-order polynomial which is

smooth in the first derivative and continuous in the

second derivative. It is further called natural if the se-

cond derivatives at its boundaries are enforced to be both

zero. Given a set of coordinates

( ,),.....,(,)

xy xy

, we

seek a natural cubic spline function f(x) that.

() 0

yfxi n= ≤≤

(6)

Let us define

(, )

is one interval, the second de-

rivatives of the function f(x) at

and

are

()

′′

and

()

′′

respectively. The cubic spline on the in-

terval can be rewrite as a cubic polynomial,

()()( )()( )

iii i

fxAfx BxCfx Dfx

′′ ′′′

=++ +

(7)

[, ]

x xx

∈

where

Axx

−

=−

Bxx

−

=−

1( )()

CA Axx

=−−

1( )()

DBBxx

=−−

3. FUZZY c-MEANS WITH CUBIC

SPLINE (FCMS)

This integrated clustering algorithm includes two steps:

the first one is modeling gene expression data by cubic

spline, by which noise and variation can be filtered and

the clustering algorithm can be more effective. The se-

cond step is to par tition the data based on the spline, due

to the continuity of the curve, we introduce a geometry

term, radius of curvature, to extract the feature of the

curve.

3.1. Modeling Gene Expression with Cubic

Spline

Due to biological and experimental factors, noises are

intensive in gene expression measurements [14,15].

However, the FCM is sensitive to the noise, which caus-

es improper positioning of the prototypes as the noise

“attract” the prototypes. Theref ore, the effective way is

denoising and recovering the original values of the data.

Many denoising methods are borrowed from signal sys-

tem and image processing to deal with noise, such as low

pass filtering [16], wavelet denoising [17], etc. these me-

Y. Wang et al. / Journal of Biosciences and Medicines 1 (2013) 16-21

thods only focus on the signal characteristic of the noise

without consideration of time series attributes, therefore

no significant results are reported so far. According to

Dejean et al. [18], cubic spline can represent time series

gene expression appropriately. However, interpolating

cubic splines to time series expression data may inadver-

tently attribute significance to measurements dominated

by noise due to over-fitting.

To infer meaningful gene expression trends over time,

we wish to fit natural cubic splines to expression data in

a smooth fashion. Define each gene expression.

()

ij ij

y ft

= +

(8)

where

denotes the observation for the i th gene at

time

, f is a continuous and differentiable function,

and

are independent and identically distributed ran-

dom variables satisfying classical assu mption s

()0, ()

ij ij

E Var

ε εσ

= =

(9)

A standard curve fitting process is to minimize the resi-

dual sum of squares (RSS) in Figure 1:

RSS=(( ))

y ft

−

∑

(10)

Simultaneously, in order to avoid over-fitting (the

curve passes through all the data points), parameterizing

f(t) by a set of pre-specified basis functions. A more

flexible strategy is to impose a smoothness condition,

which is also scientifically desirable here. Here, we

adopt a standard constraint used in the statistics litera-

ture, i.e.

() dft t

′′ <

∫

(11)

where

is a specific constant. We seek a cubic

smoothing spline

()

for each gene (18), which shall

be both reasonably smooth and also reasonably close to

its observation value

. As a standard practice for

spline smoothing, a cubic smoothing spline (Figure 1)

can be found by minimizing the following combined

function

=(())(1)() d

Lyftf tt

λλ

′′

− +−

∑∫

(12)

The first term of Eq.12 is residual sum of squares,

which quantifies the closeness to gene expression data

points, and the second term, is the integrated squared

second derivative, which quantifies the smoothness of

the fitted spline. The smoothing parameter,

[0,1]

∈

, is

used to control the trade-off between the above two con-

tradictory criteria. If setting

= 0, a straight line is

generated from an ordinary linear least-squares regres-

sion. If setting

= 1, it leads to a cubic interpolating

spline by passing through all values. The selection of

can be fou nd in [18].

Figure 1. Cubic spline modeling gene expression profiles.

3.2. Similarity Metric

The similarity metric is generally required in every clus-

tering method, which has a crucial influence on the clus-

tering result. Conventional choices are Euclidean dis-

tance, Pearson correlation [19]. However, both of the two

metrics cannot be applied in continuous vectors. Moreo-

ver, Euclidean computed the magnitude change by neg-

lecting the meaningful shape information. Pearson cor-

relation coefficient is not capable of uncover nonlinear

pattern. In this p aper, we propose a new similarity metric

to calculate the similarity between gene profiles. After

cubic spline smoothing, gene expression profiles are

transformed from discrete values to continuous curves.

Following [15], the discrete values of radius of curvature

can be considered as one important feature vector for the

curve. Similarity between two curves is calculated by

normalizing the dot product of the vectors.

For curves, the radius of curvature at a given point is

the radius of a circle that mathematically best fits the

curve at that point. It can be seen from Figure 2 that the

radiuses of the two cycles represent the radius of curva-

ture at the two different points in the curve. However, a

cubic spline curve sometimes contains inflection points.

At these points, curvature value becomes zero and the

radius of curvature value becomes infinity. Therefore,

curvature is used for computation instead of the radius of

curvature

In geometry, the radius of curvature is the inverse of

the curvature. In the case of a plane curve, the radius of

curvatu re can be computed by [20],

2 32

(1 )y

′

=′′

(13)

where

′=

′′ =

To adjust the various sizes of the curves, the total

Y. Wang et al. / Journal of Biosciences and Medicines 1 (2013) 16-21

Figure 2. Radius of curvature of a curve.

length of radius of curvature is rescaled to 1. The radius

of curvature must be calculated according to the knot

sequence of the knot vector. Similarity between curve A

and curve B is evaluated by normalizing the dot product

of the vectors.

Sab

=

(14)

By definition, the curvature of a curve is nonnegative,

which limits its application to gene expression similarity

metric. Figure 3(a) shows an example of two curves

having the same radius of curvature and the same coun-

ter-clockwise direction while F i gu re 3 (b) shows that two

curves have the same radius of curvature and the same

clockwise direction. However, the two curves have to-

tally different expression trend. In many cases it is useful

to ascribe a sign to the curve. The choice of the sign is

usually connected with the tang ent rotation the curvature

of the curve is positive when its tangent rotates counter-

clockwise. The curvature of the curve is negative when

its tangent rotates clockwise. However, this simple use of

the direction cannot capture the gene profiles similarity.

We modify the radius of curvature by adding a sign func-

tion of the first derivative of the curve.

2 32

(1 )

sgn( )y

′

=′′



(15 )

The Algorithm for the simulations is given below

Step 1: construct the cubic spline to modeling the time series gen e

expression data according to Eq.12.

Step 2: select N points of the cubic spline and calculate the radius

of curvature in these points by Eq.13.

Step 3: Compute the similarity according to equation Eq.14.

Step 4: Run Fuzzy c-means Clustering algorithms based on the new

similarity metric.

Step 5: Evaluate the result by validity measures.

4. DATA, EXPERIMENTS AND RESULT

4.1. Data

The complete yeast gene expression profiles include

6200 genes measured every 10 min during two cell

cycles in 17 hybridization experiments. Cho et al. [14]

(a)

(b)

Figure 3. Comparison of the directions of curves.

selected 384 genes whose expression levels peak at dif-

ferent time points corresponding to the five phases (G1,

S, G2/M, M/G1 and S/G2) of cell cycle. Further, Yeung

et al. [21] extract 237 genes from the yeast cell cycle

data which correspond to four categories: DNA synthesis

and replication, organization of centrosome, nitrogen and

sulphur metabolism, and ribosomal proteins

4.2. Results

Adjusted Rand index (ARI) is a measure of agreement

between two partitions: one is the clustering result and

the other is the standard partition. The value of ARI va-

ries from 0 to 1 and higher value means that the cluster-

ing result is more similar to the standard partitions. Sup -

pose T is the true clustering of a gene expression data set

based on domain knowledge and C a clustering result

given by some clustering algorithm. Let a denote the

number of gene pairs belonging to the same cluster in

both T and C, b is the number of pairs belonging to the

same cluster in T but to different clusters in C, c is the

number of pairs belonging to different clusters in T but to

the same cluster in C and d is the number of pairs be-

longing to different clusters in both T and C.

2(- )

ARI( ,)(+)(+)+(+)(+)

ad bc

TC abbd accd

(16)

Silhouette width index (SWI) is a measur e of tightness

and separation of clusters, which is used to assess the

level of statistical significance of cluster s. The Silhouette

width for the

sample in cluster

is de fined as,

{ }

() ()

() max(), ()

bi ai

si ai bi

−

(17)

Y. Wang et al. / Journal of Biosciences and Medicines 1 (2013) 16-21

where

()ai

is the average distance between the

sample and all of the samples included in

, and

()bi

is the minimum average distance between the

sam-

ple and all of the sample and all of the samples clustered

(

1,...., ;kck j= ≠

). When

()si

is close to 1,

one may infer that the

sample has been well clustered.

Thus, for a given cluster

(

1,....,jc=

), it is possible

to calculate a cluster Silhouette

, which characterizes

the heterogeneity and isolation properties of such a clus-

ter,

1()

S si

∑

(18)

Before experiments, the data was log2 transformed to

make symmetry between negative and positive fold

change and normalized to obtain a mean expression val-

ue of one for each gene. This ensures that genes which

share the same expression pattern have similar gene ex-

pression vectors.

= 0.8. Each algorithm run 10 times

with randomly initialization, results are obtained by the

averages value.

It can be seen from Table 1 that the clustering results

by FCMS outperforms FCM on ARI and SWI. This not

only indicates that clusters generated by FCMS are better

in intra compactness and in ter separateness, but also illu-

strates that clusters include more biological significance.

Heatmap [22] is used to graphically represent multi-

dimensional gene expression data which have been sub-

jected to clustering algor ithms. Fi gu re 4 shows the qual-

ity of clusters of Yeast 2945 [23]. It can be seen in Fig-

ure 4 that the FCMS shows better separated and homo-

geneous clusters than FCM.

5. CONCLUSION

Conventional partition clustering methods are frequently

used for gene expression analysis without consideration

of the noise and variations in expression that do not ﬁt

into any global pattern. In this paper, we present an inte-

grated fuzzy clustering approach, FCMS, uses spline

estimation to represent gene time-series expression pro-

files as continuous curves, by which noise can be filtered

and meaningful data will be preserved. Similarity is an

crucial factor for clustering, we introduce a new geome-

try term of radius of curvature, which can capture the

similarity between curves. Results demonstrate that our

Table 1. Comparison of FCM and FCMS.

Clusters

number ARI SWI

Yeast 384 FCM 5 0.5563 0.3545

FCMS 5 0.5681 0.4012

Yeast 237 FCM 4 0.4658 0.3242

FCMS 4 0.4821 0.3869

(a) (b)

Figure 4. Cluster structure plot generated by

GEDAS [22]. (a) Cluste r structure of FCM; (b)

Cluster structure of FCMS.

clustering method has substantial advantages over FCM

for time-series gene expression data.

6. ACKNOWLEDGEMENTS

This work was partly supported by the European project FP7-Marie

Curie Actions-IRSES-MARSIQEL.

REFERENCES

[1] Belacel, N., Wang, Q. and Cuperlovic-Culf, M. (2006)

Clustering methods for microarray gene expression data.

Omics: A Journal of Integrative Biology, 10, 507-531.

[2] Tsai, T.H., Milhorn, D.M. and Huang, S.K. (2006) Mi-

croarray and gene-clustering analysis. Methods in Mole-

cular Biology, 315, 165-174.

Y. Wang et al. / Journal of Biosciences and Medicines 1 (2013) 16-21

[3] Tang, R. and Muller, H.G. (2009) Time-synchronized

clustering of gene expression trajectories. Biostatistics,

10, 32-45.

http://dx.doi.org/10.1093/biostatistics/kxn011

[4] Song, J.J., Lee, H.J., Morris, J.S. and Kang, S. (2007)

Clustering of time-course gene expression data using

functional data analysis. Computational Biology and Che-

mistry, 31, 265-274.

http://dx.doi.org/10.1016/j.compbiolchem.2007.05.006

[5] Bar-Joseph, Z. (2004) Analyzing time series gene expres-

sion data. Bioinformatics 20, 2493-2503.

http://dx.doi.org/10.1093/bioinformatics/bth283

[6] Bar-Joseph, Z., Gerber, G.K., Gifford, D.K., Jaakkola,

T.S. and Simon, I. (2003) Continuous representations of

time-series gene expression data. Journal of Computa-

tional Biology, 10, 341-356.

http://dx.doi.org/10.1089/10665270360688057

[7] Luan, Y.H. and Li, H.Z. (2003) Clustering of time-course

gene expression data using a mixed-effects model with

B-splines. Bioinformatics, 19, 474-482.

http://dx.doi.org/10.1093/bioinformatics/btg014

[8] Gasch, A.P. and Eisen, M.B. (2002) Exploring the condi-

tional coregulation of yeast gene expression through

fuzzy k-means clustering. Genome Biology, 3, RE-

SEARCH0059.

[9] Boratyn, G.M., Datta, S. and Datta, S. (2006) Biological-

ly supervised hierarchical clustering algorithms for gene

expression data. 28th Annual International Conference of

the IEEE Engineering in Medicine and Biology Society

(EMBS’06), New York, 30 August-3 September 2006,

5515-5518.

[10] Wang, J., Delabie, J., Aasheim, H., Smeland, E. and

Myklebost, O. (2002) Clustering of the SOM easily re-

veals distinct gene expression patterns: Results of a rea-

nalysis of lymphoma study. BMC Bioinformatics, 3, 36.

http://dx.doi.org/10.1093/bioinformatics/btg014

[11] Zhang, M., Adamu, B., Lin, C.C. and Yang, P. (2011)

Gene expression analysis with integrated fuzzy C-means

and pathway analysis. Annual International Conference

of the IEEE Engineering in Medicine and Biology Society

(EMBC 2011), Boston, 30 August-3 September 2011,

936-939.

[12] Dembele, D. and Kastner, P. (2003) Fuzzy C-means me-

thod for clustering microarray data. Bioinformatics, 19,

973-980. http://dx.doi.org/10.1093/bioinformatics/btg119

[13] Du, P., Gong, J., Syrkin Wurtele, E. and Dickerson, J.A.

(2005) Modeling gene expression networks using fuzzy

logic. IEEE Transactions on Systems, Man, and Cyber-

netics, Part B: Cybernetics, 35, 1351-1359.

http://dx.doi.org/10.1109/TSMCB.2005.855590

[14] Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. and

Ruzzo, W.L. (2001) Model-based clustering and data

transformations for gene expression data. Bioinformatics,

17, 977-987.

http://dx.doi.org/10.1093/bioinformatics/17.10.977

[15] Yang, C.M., Wan, B.K. and Gao, X.F. (2003) Data pre-

processing in cluster analysis of gene expression. Chinese

Physics Letters, 20, 774-777.

http://dx.doi.org/10.1093/bioinformatics/17.10.977

[16] Futschik, M.E. and Carlisle, B. (2005) Noise-robust soft

clustering of gene expression time-course data. Journal of

Bioinformatics and Computational Biology, 3, 965-988.

http://dx.doi.org/10.1093/bioinformatics/17.10.977

[17] Hu, X., Yoo, I., Zhang, X., Nanavati, P. and Debjit, D.

(2005) Wavelet transformation and cluster ensemble for

gene expression analysis. International Journal of Bioin-

formatics Research and Applications, 1, 447-460.

http://dx.doi.org/10.1504/IJBRA.2005.008447

[18] Dejean, S., Martin, P.G., Baccini, A. and Besse, P. (2007)

Clustering time-series gene expression data using

smoothing spline derivatives. EURASIP Journal on Bio-

informatics & Systems Biology, 2007, Article ID: 70561.

[19] Yin, L., Huang, C.H. and Ni, J. (2006) Clustering of gene

expression data: Performance and similarity analysis.

BMC Bioinformatics, 7, S19.

http://dx.doi.org/10.1186/1471-2105-7-S4-S19

[20] Kuragano, T. and Kasono, K. (2008) Curve generation

and modification based on radius of curvature smoothing.

Proceedings of the 10th WSEAS International Conference

on Mathematical and Computational Methods in Science

and Engineering (MACMESE’08), Bucharest, 7-9 No-

vember 2008, 80-87.

[21] Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. and

Ruzzo, W.L. (2001) Model-based clustering and data

transformations for gene expression data. Bioinformatics,

17, 977-987.

http://dx.doi.org/10.1093/bioinformatics/17.10.977

[22] Fu, L. and Medico, E. (2007) FLAME, a novel fuzzy

clustering method for the analysis of DNA microarray

data. BMC Bioinformatics, 8, 3.

http://dx.doi.org/10.1038/10343

[23] Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. and

Church, G.M. (1999) Systematic determination of genetic

network architecture. Nature genetics, 22, 281-285.

http://dx.doi.org/10.1038/10343