A Projection Clustering Technique Based on Projection

doi:10.4236/jssm.2009.24043

Paper Menu >>

Journal Menu >>

J. Service Science & Management, 2009, 2: 362-367

doi:10.4236/jssm.2009.24043 Published Online December 2009 (www.SciRP.org/journal/jssm)

A Projection Clustering Technique Based on

Projection

Xiyu LIU1, Xinjiang XIE 2, Wenping WANG1

1School of Management and Economics, Shandong Normal University, Jinan, China; 2Department of Computer Science and Tech-

nology, Shandong Agricultural Administrators College, Jinan, China.

Email: xyliu@sdnu.edu.cn, wenping216@126.com, xjxie@163.com

Received July 24, 2009; revised September 10, 2009; accepted October 20, 2009.

ABSTRACT

Projection clustering is an important cluster problem. Although there are extensive studies with proposed algorithms

and applications , one of the basic computing arch itectures is that they are all at the level of data objects. The purp ose

of this paper is to propose a new clustering technique based on grid architecture. Our new technique integrates mini-

mum spanning tree and grid clustering together. By this integration of projection clustering with grid technique, the

complexity of compu ting is lowered to .

O(NlogN)

Keywords: Cluster Analysis, Projection Clustering, Grid Clustering

1. Introduction

Cluster analysis is as an important area in data mining

which can explore the hidden structures of business da-

tabases [1]. Traditionally, cluster analysis can be catego-

rized as three classes. Partitioning method works by con-

structing various partitions and then evaluating them by

some criterion. Hierarchy method creates a hierarchical

decomposition of the set of data (or objects) using some

criterion. Density-based method is based on connectivity

and density functions. Grid-based method is based on a

multiple-level granularity structure. Model-based method

is to construct a model and to find the best fit model.

Along these lines, many techniques and algorithms

have been proposed in the literature. For example, Ester

et al. [2] present the density-based clustering algorithm

which uses an Eps-neighborhood for a point containing

at least a minimum number of points. Raphael Bar-Or

and Christiaan van Woudenberg [3,4] present a grav-

ity-based clustering method intended to find clusters of

data in n-space. The most classical clustering technique

is due to Raymond T. Ng and Jiawei Han [5] who devel-

oped a CLARANS which aims to use randomized search

to facilitate the clustering of a large number of objects.

More recent work include agglomerative fuzzy K-Means

clustering algorithm by introducing a penalty term to the

objective function to make the clustering process insensi-

tive to the initial cluster centers [6].

Among all these clustering techniques, one of the basic

measurements is the Euclidean distance. It requires simi-

lar objects to have close values in all dimensions. When

similarity between objects in high dimensional space is

absent, this kind of technique is often invalid. To solve

this problem, dimension reduction and manifold learning

is applied [7–9]. Another method for this skewed data is

the projection clustering [10]. The main idea of projected

clustering is that different clusters may distribute along

part of the dimensions. A projected cluster is a subset of

data points, together with a subspace of dimensions, so

that the points are closely clustered in the subspace.

Different with the above clustering approaches, graph

clustering works by transfo rming the initial working d ata

into a kind of graph. Then graph clustering techniques

can be app lied to ob tain the final clustering. One of these

techniques is the Minimum Spanning Tree (MST) based

clustering. A lthough the f irst MST-b ased clustering algo-

rithms have been studied for many years, due to its com-

putational efficiency for large databases, it attracts new

researches frequently. In a more recent work [11], the

authors present a more efficient method based on the

divide and conquer approach that can quickly identify the

longest edges in an MST so as to save some computa-

tions. The experimental results show that their MST in-

spired clustering algorithm is very effective and stable

when applied to various clustering problems. The authors

also expect that their algorithms have a

computing time. (log)ON N

In this paper, we propose a new projection clustering

technique by Minimum Spanning Tree based on the grid

A Projection Clustering Technique Based on Projection363

clustering approach. Basically, our MST-inspired clus-

tering technique works on cells rather than data points

directly. This will significantly reduce the size of graph

and MST. Due to this reason, our technique has no spe-

cific requirements on the dimensionality of the data sets.

This is different from some typical projection clustering

algorithms [10].

The rest of this paper is organized as follows: Section

2 presents the basic idea of projection clustering. In Sec-

tion 3, we summarize some basics on MST-based clus-

tering techniques. In Section 4, we propose a flexible grid

clustering approach. Clustering is also discussed as an

optimization process in this section. Section 5 contains a

brief clusteri ng behavior and tim e complexity analysis.

2. Projection Cell Clustering

Suppose the raw data set is 1

{, ,}

xx

(,, )

. Each point

has n components by

xx. It is contained in a

rectangle D0 in Rn. Generally, we will not cluster the

original data set X. Instead, we will consider a derived

data set X composed of data cells. However, after we

transforme the data set into cells, each cell can be considered

as a new data point represented by its center. The nu mber

of data points in the cell is called the mass of the cell.

More precisely, a cell (grid) is a polyhedron as a com-

plex x = (D, norm, c, b, p), where D is the polyhedron,

norm is a unit vector indicating normal vector of one of

its edges, c is the center, b is a boolean value indicating

whether the grid is dense or sparse, and p is the number

of data points covered by the grid.

In order to simplify symbols, we will use as cell

data object and xX

[]

X as the data points defined by x

in the or iginal data space X. For two objects x, y, the dis-

tance is defined as the minimum distance of the two cells

[], []

(, )min(,)

pxqy

ydpq





 (1)

The diameter of a data object is measurement of its

size

,[]

()max(,)

2pqx





 (2)

Let N(x) be the set of k-nearest neighbors (KNN) of x

including itself, then the number of object in N(x) is k+1.

The sparseness or thickness of the data object can be

measured by the relative location of its k-nearest

neighbors

()

() (,)

zNx





 (3)

Suppose there is a mass function defined on by

; (total number of points).

Then we define the density of a data point as

:mR



X() #[]mxx p

()

() ()

()

xNx





 (4)

Suppose



. We use i



to denote the pr ojection

operator in the i-th component, i.e., () i





Respectively, (,dx ) ()

y dy,x





. For ,xy



X, de-

fine {:

ii[]}

zz x





(,)

, ( ,)

yxy





, and

() (







. Then we consider projection into the

i-th component, the KNN neighbor set N(x) is replaced

by 1

{ ,

xy()Nx ,,: ii





 are the k-nearest points

to i



}. The corresponding definition of sparseness and

density are

()

() (,)

zNx





,

()

() ()

() i

iixN x





 (5)

Now we describe the process of projected clustering.

The main idea is that distance between data objects is

restricted to subsets of dimensions where object values

are dense [10]. This means that we only consider contri-

butions of relevant dimensions when computing the dis-

tance between data point and the cluster center.

Different from [10], we use a fixed threshold value to

determine the dense and sparse dimensions. Now let



X to be a cell data object. Suppose 00



 is a

positive threshold determined in the process of griding

process. Define a matrix []

ijN n







 by1

1,if( );1,,

0, else



















n

(6)

By this index matrix we obtain a projected cell dis-

tance as follows

(, )((, ))

xj yjj

xy xy





















 (7)

3. Minimum Spanning Trees

Let G=(V, E) be a connected, undirected edge-weighted

graph with N nodes as before. W = [wij] is the weight

matrix. For any , we use Q = G-P to denote the

subgraph generated by the vertices V\P called a partition

of nodes. A spanning subgraph is a subgraph that contains

all the vertices of the original graph. A minimal spanning

tree of graph G is a spanning graph with no circuits

whose weight is minimum among all spanning t rees of G.

PV

For a partition P, Q of G, define (,)PQ



as the

smallest weight among all edges from the cut-set C (P,

Q), which is the set of edges connecting P and Q. A link

is any edge in C(P, Q) with weight (,)PQ



. The link

set is denoted by (,)PQ



[12].

1Notice that the size of this matrix maybe less than N since we are using

cells instead of data points.

A Projection Clustering Technique Based on Projection

364

There are several ways to build Minimum Spanning

Tree (MST) from the graph [11]. Two popular ways to

implement the algorithms are the agglomerative and the

divisive procedures.

The well-known agglomerative procedures are the

Kruskal and Prim’s algorithms. The first one works by

constructing the tree from initial N isolated vertices of

the original graph. All the edge s are so rted into a non-d e-

creasing order by their weights. For each edge which is

not in the tree, if this edge does not form a cycle with the

current tree, then we can add this edge to the tree. In the

Prim’s algorithm, the tree construction starts with a root

node. At each step, among all the edges between the

nodes in the tree T and those which are not in the tree yet,

the node and the edge associated with the smallest weight

to the tree T are added.

The second kind of algorithm is the divisive one called

the reverse delete algorithm starting with the full graph.

Edges are deleted in order of nonincreasing weights bas-

ed on the cycle property as long as keeping the connec-

tivity of the graph.

Some well-known properties of MST are summarized

in the following theorem [12].

Theorem 3.1. The minimum spanning tree T(G) of a

graph G has the following properties.

1) T contains at least one edge from (,)PQ



for each

partition P, Q.

2) Each edge of T is a link of some partition of G.

3) Let (C1,C2) be a partitio n of G. If 12

(,)( ,)PQC C





for each partition (P, Q) of C1, then T (C1) forms a con-

nected subtree of T (G).

Once we have the MST, we can obtain the desired

clustering by removing inconsistent edges of MST. The

simplest way to define inconsistent edges is using weight

measure ratio of the edge with average weight of nearby

edges in the tree [12]. If the ratio is larger than a thresh-

old, then it is incon sistent. We can determine a stop crite-

ria by the number of clusters, or a minimum size of any

cluster by removing edges which can result in two clusters

whose sizes are larger than the minimum cluster size.

If we know the number of clusters k, then clustering

can start by removing k-1 arbitrary edges from the tree,

creating a k-partition. Then we can minimize the change

of the total weight of the current clusters to obtain the

final clustering.

To reduce comp utation co mplex ity, Xiaochun Wang et

al. proposed a divide and conquer approach [11]. Given a

loose estimate of minimum and maximum numbers of

data items in each cluster, they propose an iterative ap-

proach for MST clustering algorithm in five steps: 1) Sta rt

with a spanning tree built by the sequential initialization

(SI). 2) Calculate the mean and the standard deviation of

the edge weights in the current distance array and use

their sum as the threshold. Partially refine the spanning

tree by running Divisive Hierarchical Clustering Algo-

rithm (DHCA) multiple times until the percentage

threshold difference between two consecutively updated

distance arrays is below 6



. 3) Identify and verify the

longest edge candidates by running MDHCA until two

consecutive longest edge distances converge to the same

value at the same places. 4) Remove this longest edge. 5)

If the number of clusters in the data set is preset or if the

difference between two consecutively removed longest

edges has a percentage decrement larger than 50 percent

of the previous one, we sto p. Ot herwise go to Step 3.

However, if the graph size is not large, we can directly

get clustering from the graph. When we use the flexible

grids technique to obtain the graph, this is often th e case.

Anyway, the technique of [11] can be applied to further

reduce the computing time.

4. Grid Based Spatial Clustering

The grid based clustering uses a multi-resolution grid

structure which contains the data objects and acts as op-

erands of clustering performance [1]. For example, the

authors [13] propose a gravity based grid which ap-

proximates the cell influence by gravity centers. The au-

thors claim that the proposed technique can reduce

memory usage and simplify computational complexity

with minor loses of the clustering accuracy.

Traditional grids are regular hypercubic grid. This re-

quires the grid construction cover all the data space with

the same precision. The second method uses flexible

grids, i.e. multi-resolution grids with hypercubic or hy-

per-rectangular cells having randomly oriented borders

[14]. The main clustering technique is a tree based

searching with a similarity measure composed of both

the density and distance differences [15].

Suppose the data set is 1

[, ,}n

xxR



. It co-

ntains in a rectangle in . A grid is a graph G

where each node is a complex v = ( D, norm, c, is-

Crowded, p), where D is the polyhedra, norm is a unit

vector indicating normal vector of previous cutting plane,

c is a point which lies in the grid acting as its center, is-

Crowded is 1 or 0 indicating whether the grid is highly

populated or not, and p is the number of da ta points cov-

ered by the grid. The initial grid is D0 with an arbitrary

normal vector. In each step, we can define the center of

the grid as its geometrical center.

For two nodes (

vD , ,,,)

ii i

norm c isCrowdedpi



, 1, 2i



xists a connecting edge between them. The weight

value on this edge is defined as

there e

(8)

12 12

0,if min{,}0

(, )(,), else

DD vv













A Projection Clustering Technique Based on Projection

365

The graph is constructed in an iterative way. We start

with an initial candidate node

where D is the origi

graph into clusters. A commonly used technique to deal

with this problem is the hierarchical clustering [1]. Ag-

glomerative methods iteratively connect vertices that are

close to each other with edges. Small clusters are merged

with each other building up the hierarchical structure to

find the desired larger clustering structure. Divisive

methods on the contrary are based on network flows.

This is done by iteratively identifying the intercluster

edges in a top-down approach.

(,vD

nal rectangle, norm0

p is the total popula-

,,normcis

,)Crowded p

tion number. T

is a random selected unit vector, c is the geometrical

center of D0, is Crowded = 1, and

hen at each step, the cell containing more

number of points (controlled by a threshold value p



, or

larger enough by diameter) controlled by another thresh-

old value d



, is split into two subcells by a hyperplane

which is orthogonal to the current normal vector.si-

tion of the hyperplane is random. A cell is called

crowded if its population is larg er than p



. Otherwise it

is called sparse. If we reach a sparse cell, then add this

cell to the node set of the graph. If we reach a cell with

diameter less than d



, then add this celle node set.

This step continues until each cell has a population

less than p

to th



, or its diameter is smaller than d



. Table 1

gives the algorithmr the graph construction process.

Once we have completed the graph construction, those

nodes in the graph which are not crowded will corre-

spond to vacant area or outliers. Therefore, in order to

reduce computing complexity, we first remove all sparse

graph nodes with corresponding edges. The resulting

graph is G = (V, E) where V is the set of vertices, E the

set of weighted edges. An example is shown in Figure 1

with part of its edge s.

Now we use C(X) = {Xq: q = 1, 2,…k} to denote a

clustering of the data set X where O is the set o f outliers.

Then

ithph e

By this algorithm, we can generate a hierarchical grid

()CCX





together w a resulting graph. When the gra is gener-

ated, the clustering will become grouping nodes of th

s construction algorithm

Algorithm: Construction of flexible grids

(9)

Table 1. Flexible grid

Inputs

1, ,}

x x dataset of N points in n

{X

D: hyper-rectangle co n ta in in g X



: population threshold value.



: cell diameter threshold value

Outputs

{, , }

Vv v set of vertices

Begin

00000

(0){(,, ,1,)}VvDncp . Let t = 0.

while

for each v = (D, n, c, isCrowded, p)

()Vt



V(t)

Generate a cutting hyperplane L passing c and with normal

vector parallel to n. Cut the current cell v into two subcells

,vv. For each new node, if p < p



or diam(D) < d



add this new cell to the node set V. Else add it to V(t + 1).

Let the new norms be orthogonal to n.

end

t + +;

end

End

A Projection Clustering Technique Based on Projection

366

Figure 1. An example clustering area three clusters

For given population threshold value and diameter

threshold value, we can generate the flexible grid and

obtain a graph. Then we can generate a minimum span-

ning tree. Then we can get the final clustering. Figure 1

shows an example of minimum spanning tree corre-

sponding to the data set in Figure 1.

Consequently, for fixed

with



and p



, the data set X is

clustered as (, ,

CX )



. We define||

as e

number of cells in Xq. Define the energy of clustering as

the sum of intra-cluster variation measure and in-

ter-cluster distance as

(,,)

|| ()()

min (,)

ij q

qvv qi

vv X

kXdiam Ddiam D





















(10)

By this grid based method, the final clustering can be

treated without direct computation to the data points and

reduce the number of nodes significantly. The only p-

ral

ne way to choose these two parameters is to optimize

the energy of clustering.

5. A Performance Study

e twe want to cluster a data set with

N objects. By the graph construction al the

ious section, the data

meters we need to determine beforehand is the cel

wded threshold value and the minimum cell diameter. cro

Supposhat n

R

gorithm in



set is split inier-

chy. A minimum spanning tree is constructed associ-

ated with the cell grap h.

make things simpler, we will assume that the cut-

ting planes are perpendicular to one of the axis in this

section. Moreover, we assume that each cutting plane

through the geometrical center of the current cell.

Therefore, all the cells are rectangles. Then we can easily

count the total number of nodes of the graph.

se the original data cube D0 has a diameter

to a cell h

passes

Suppo 0



and the initial popu lation threshold is p



. For some

large number M, let

suf-

. L etficiently 0

2,1,2,,

iiM







be another decreasing

priate two sequences, we c

For pecific populion

rameters ,

sequence. Bhoosing apy cpro-

an optimize the clustering.

sat and diameter threshold pa-





, let the induced graph be G(,





) with

ng tree T(minimum spanni,





). A cluste of T is

denoted b

ring

y C(,





cide

) is a set disjoint subtree whose

node with thriginal truse

e o

ee. We s set coin

(,)





C to denote the clustof the data

Evidently we have the fo llowing properties.

ering set.

Theorem 5.1. Clusters have two properties.

1) Anti-joint property. If 12



, then two disjoint

clusters in 1

(, )





C are disjoint in 2

(, )





2) Monotonicity property. If 12



, then a cluster in

(,)





C cannot be disintegrated in 2

(,)





ulatNow we assume the popion throld esh



is a con-

stant which determines thgranularity of the problem.

Therefore we usee

()



C tog denote the clus. Let us

split the energy into two parts, the intra-cluster energy

()

terin



and the intercluster energy ()



as follows

(,)

|| ()()

ij q

qvv qi

vv X





diamDdiamD







 (11)

(,)min (,)







 (12)

Theorem 5.2. Inter-cluster energy is monotone. That is

to say, if 12



, then 12

(, )(,)

ie ie

EX EX



.

Proof: It is clear to see that when 12



, a cell

maybe split into small cells. Therefo re, either the set i

will be smaller which means that the distance

(, )





 will become larger.

In a recent work [11] the authors propose a divide and

A Projection Clustering Technique Based on Projection

367

g of two phases. The

tial initialization and the

se uses

conquer based algorithm consistin

first phase includes the sequen

spanning tree updating, and the second pha some

technique to locate the longest edges and partitions the

obtained approximate minum spanning tree to form

sensible clusters. The authors expect the first phase has

(log)OfNNwhere

is constant. The average time

complexity of the second phase is (log)OeNN where e

is constant. Therefore their expectation of time complex-

ity is(log)ONN . However, our algorithm do provide a

time complexity of (log)ONN .

Theorem 5.3. Suppose thresholds are p





and





. Then the time complexity of the flexible gs

construction algorithm is (log)ONN .

Proof: At each stage, if a,d

) is sparse, i.e., isCrowded =0, then the cell is a node

in the graph. Otherwise, a cutting hyperplane L passing c

rid

cell

ll v

(,,,vD n c isCrowde

and with normal vector parallel to n is generated. The

current ce is cut into two subcells 12

,vv r each

new node, if p

. Fo



 or () d

d Diam



, add this new

cell to the node.

In this process, the computation of isCrowded and p

e a time of ()

ON where i

N is the nu

set V

both havmber of

ceta points in the new cell. Ideally, the plane L passes the

nter of the cell v. Hence /2



for i = 1; 2. If not

so xpect th. Let thotal time

co we

2T that T(

d arc

structio re rela-

cated. The main ingredient here is the application of grid

clustering to pr oject i o n cl ust er i ng.

Apparently, this research will lead to efficie

rithms. In future work, we will give experimental study

on the new technique. This will be lengthy, for the clus-

tering is essentially an optimization process. The best

population threshold

, the plane is randomly placed by a uniform distribution

which we ee same propertye t

mplexity be T(N). Thenhave T(N) =

(N=2)+O(N). Hence we knowN) =O(N log N).

Conclusions

In this paper we present a new projection clustering tech-

nique based on grihitecture and minimum spanning

tree. The effective using of minimum spanning tree can

possibly reduce computing complexity although the con-

n of the graph and the tree atively compli

nt algo-



is to be determin

timizes the clustering energy presented in Section 3. For

th , oith an application.

ar”

Research is e Natural Science

Foundation of Che Natural Science

techniqu d

er, and X. Xu

, OR, pp. 226.

[3] novel

me 1.

[4] enberg, “Gene expression

, Septemctober 2002.

. M. Cheung, and J. Huang, “Agglom-

mixed typ

09, 2005.

Y. Zha, “Principal manifolds and

ation approach

ing algorithms,” Journal of

pp. 1267–1274, 2003.

-based clustering,” J. N. Kok et

the Science and Technology Project of Shandong Educa-

tion Bureau.

REFERENCES

[1] J. W. Han and M. Kamber, “Data mining concepts and

es,” 2nd Eition, Elsevier, Singapore, 2006.

[2] M. Ester, H. P. Kriegel, J. Sand, “A den-

sity-based algorithm for discovering clusters in large spa-

tial databases with noise,” in Proceedings 2nd Interna-

tional Conference on Knowledge Discovery and Data

Mining, Portland–231, 1996

R. Bar-Or and C. van Woudenberg, “Agrav-

ity-based clustering method, technical report,” Depart-

nt of Applied Mathematics, University of Colorado,

Denver, CO 80202, 200

R. Bar-Or and C. van Woud

analysis with a novel gravity-based clustering method,”

pp.1–46, December 2001.

[5] R. T. Ng and J. Han, “Clarans: A method for clustering

objects for spatial data mining,” IEEE Trans. on Knowl-

edge and Data Engineering, Vol. 14, No. 5, pp.

1003–1016 ber/O

[6] M. Li, M. K. Ng, Y

erative fuzzy K-Means clustering algorithm with selection

of number of clusters,” IEEE Trans on Knowledge and

Data Engineering, Vol. 20, No. 11, pp. 1519–1534, 2008.

[7] Z. He, X. Xu, and S. Deng, “Scalable algorithms for clus-

tering large datasets withe attributes,” Interna-

tional Journal of Intelligent Systems, Vol. 20, pp.

1077–1089, 2005.

[8] Z. F. He and F. L. Xiong, “A constrained partition model

and K-Means algorithm,” Journal of Software, Vol.16,

No.5, pp. 799–8

[9] Z. Y. Zhang and H. nonlinear dimensionality reduction via tangent space

alignment,” SIAM Journal of Scientific Computing, Vol.

26, No. 1, pp. 313–338, 2004.

[10] M. Bouguessa and S. R. Wang, “Mining projected clus-

ters in high-dimensional spaces,” IEEE Transaction on

Knowledge and Data Engineering, Vol. 21, No. 4, pp.

507–522, 2009.

[11] X. C. Wang, X. L. Wang, and D. M. Wilkes, “A di-

vide-and-conquer approach for minimum spanning tree-

based clustering,” IEEE Trans on Knowledge and Data

Engineering, Vol. 21, No. 7, pp. 945–958, 2009.

[12] C. T. Zahn, “Graph-theoretical methods for detecting and

describing gestalt clusters,” IEEE Trans. Computers, Vol.

20, No. 1, pp. 68–86, January 1971.

[13] C. H. Li and Z. H. Sun, “A mean approxim

ed which op-

is reason, we will present this part of the research in

another papertgether wto a class of grid-based cluster

Software, Vol. 14, No. 7,

Acknowledgments

This project is carried out under the “Taishan Schol

project of Shandong, China.

also supported by th

ina (No.6087305 8), th

[14] Marc-Ismael, Akodj`enou-Jeannin, Kav´e Salamatian, and

P. Gallinari, “Flexible grid

al. (Eds.), LNAI 4702, PKDD, pp. 350–357, 2007.

[15] U. Brandes, M. Gaertler, and D. Wagner, “Experiments on

graph clustering algorithms,” In ESA, pp. 568–579, 2003.

undation of Shandong Province (No. Z2007G03), and