J. Service Science & Management, 2009, 2: 362-367
doi:10.4236/jssm.2009.24043 Published Online December 2009 (www.SciRP.org/journal/jssm)
Copyright © 2009 SciRes JSSM
A Projection Clustering Technique Based on
Xiyu LIU1, Xinjiang XIE 2, Wenping WANG1
1School of Management and Economics, Shandong Normal University, Jinan, China; 2Department of Computer Science and Tech-
nology, Shandong Agricultural Administrators College, Jinan, China.
Email: xyliu@sdnu.edu.cn, wenping216@126.com, xjxie@163.com
Received July 24, 2009; revised September 10, 2009; accepted October 20, 2009.
Projection clustering is an important cluster problem. Although there are extensive studies with proposed algorithms
and applications , one of the basic computing arch itectures is that they are all at the level of data objects. The purp ose
of this paper is to propose a new clustering technique based on grid architecture. Our new technique integrates mini-
mum spanning tree and grid clustering together. By this integration of projection clustering with grid technique, the
complexity of compu ting is lowered to .
Keywords: Cluster Analysis, Projection Clustering, Grid Clustering
1. Introduction
Cluster analysis is as an important area in data mining
which can explore the hidden structures of business da-
tabases [1]. Traditionally, cluster analysis can be catego-
rized as three classes. Partitioning method works by con-
structing various partitions and then evaluating them by
some criterion. Hierarchy method creates a hierarchical
decomposition of the set of data (or objects) using some
criterion. Density-based method is based on connectivity
and density functions. Grid-based method is based on a
multiple-level granularity structure. Model-based method
is to construct a model and to find the best fit model.
Along these lines, many techniques and algorithms
have been proposed in the literature. For example, Ester
et al. [2] present the density-based clustering algorithm
which uses an Eps-neighborhood for a point containing
at least a minimum number of points. Raphael Bar-Or
and Christiaan van Woudenberg [3,4] present a grav-
ity-based clustering method intended to find clusters of
data in n-space. The most classical clustering technique
is due to Raymond T. Ng and Jiawei Han [5] who devel-
oped a CLARANS which aims to use randomized search
to facilitate the clustering of a large number of objects.
More recent work include agglomerative fuzzy K-Means
clustering algorithm by introducing a penalty term to the
objective function to make the clustering process insensi-
tive to the initial cluster centers [6].
Among all these clustering techniques, one of the basic
measurements is the Euclidean distance. It requires simi-
lar objects to have close values in all dimensions. When
similarity between objects in high dimensional space is
absent, this kind of technique is often invalid. To solve
this problem, dimension reduction and manifold learning
is applied [7–9]. Another method for this skewed data is
the projection clustering [10]. The main idea of projected
clustering is that different clusters may distribute along
part of the dimensions. A projected cluster is a subset of
data points, together with a subspace of dimensions, so
that the points are closely clustered in the subspace.
Different with the above clustering approaches, graph
clustering works by transfo rming the initial working d ata
into a kind of graph. Then graph clustering techniques
can be app lied to ob tain the final clustering. One of these
techniques is the Minimum Spanning Tree (MST) based
clustering. A lthough the f irst MST-b ased clustering algo-
rithms have been studied for many years, due to its com-
putational efficiency for large databases, it attracts new
researches frequently. In a more recent work [11], the
authors present a more efficient method based on the
divide and conquer approach that can quickly identify the
longest edges in an MST so as to save some computa-
tions. The experimental results show that their MST in-
spired clustering algorithm is very effective and stable
when applied to various clustering problems. The authors
also expect that their algorithms have a
computing time. (log)ON N
In this paper, we propose a new projection clustering
technique by Minimum Spanning Tree based on the grid
A Projection Clustering Technique Based on Projection363
clustering approach. Basically, our MST-inspired clus-
tering technique works on cells rather than data points
directly. This will significantly reduce the size of graph
and MST. Due to this reason, our technique has no spe-
cific requirements on the dimensionality of the data sets.
This is different from some typical projection clustering
algorithms [10].
The rest of this paper is organized as follows: Section
2 presents the basic idea of projection clustering. In Sec-
tion 3, we summarize some basics on MST-based clus-
tering techniques. In Section 4, we propose a flexible grid
clustering approach. Clustering is also discussed as an
optimization process in this section. Section 5 contains a
brief clusteri ng behavior and tim e complexity analysis.
2. Projection Cell Clustering
Suppose the raw data set is 1
{, ,}
(,, )
. Each point
has n components by
xx. It is contained in a
rectangle D0 in Rn. Generally, we will not cluster the
original data set X. Instead, we will consider a derived
data set X composed of data cells. However, after we
transforme the data set into cells, each cell can be considered
as a new data point represented by its center. The nu mber
of data points in the cell is called the mass of the cell.
More precisely, a cell (grid) is a polyhedron as a com-
plex x = (D, norm, c, b, p), where D is the polyhedron,
norm is a unit vector indicating normal vector of one of
its edges, c is the center, b is a boolean value indicating
whether the grid is dense or sparse, and p is the number
of data points covered by the grid.
In order to simplify symbols, we will use as cell
data object and xX
X as the data points defined by x
in the or iginal data space X. For two objects x, y, the dis-
tance is defined as the minimum distance of the two cells
[], []
(, )min(,)
The diameter of a data object is measurement of its
Let N(x) be the set of k-nearest neighbors (KNN) of x
including itself, then the number of object in N(x) is k+1.
The sparseness or thickness of the data object can be
measured by the relative location of its k-nearest
() (,)
Suppose there is a mass function defined on by
; (total number of points).
Then we define the density of a data point as
X() #[]mxx p
() ()
. We use i
to denote the pr ojection
operator in the i-th component, i.e., () i
Respectively, (,dx ) ()
y dy,x
. For ,xy
X, de-
fine {:
zz x
, ( ,)
, and
() (
. Then we consider projection into the
i-th component, the KNN neighbor set N(x) is replaced
by 1
{ ,
xy()Nx ,,: ii
are the k-nearest points
to i
}. The corresponding definition of sparseness and
density are
() (,)
() ()
() i
iixN x
Now we describe the process of projected clustering.
The main idea is that distance between data objects is
restricted to subsets of dimensions where object values
are dense [10]. This means that we only consider contri-
butions of relevant dimensions when computing the dis-
tance between data point and the cluster center.
Different from [10], we use a fixed threshold value to
determine the dense and sparse dimensions. Now let
X to be a cell data object. Suppose 00
is a
positive threshold determined in the process of griding
process. Define a matrix []
ijN n
1,if( );1,,
0, else
By this index matrix we obtain a projected cell dis-
tance as follows
(, )((, ))
xj yjj
xy xy
3. Minimum Spanning Trees
Let G=(V, E) be a connected, undirected edge-weighted
graph with N nodes as before. W = [wij] is the weight
matrix. For any , we use Q = G-P to denote the
subgraph generated by the vertices V\P called a partition
of nodes. A spanning subgraph is a subgraph that contains
all the vertices of the original graph. A minimal spanning
tree of graph G is a spanning graph with no circuits
whose weight is minimum among all spanning t rees of G.
For a partition P, Q of G, define (,)PQ
as the
smallest weight among all edges from the cut-set C (P,
Q), which is the set of edges connecting P and Q. A link
is any edge in C(P, Q) with weight (,)PQ
. The link
set is denoted by (,)PQ
1Notice that the size of this matrix maybe less than N since we are using
cells instead of data points.
Copyright © 2009 SciRes JSSM
A Projection Clustering Technique Based on Projection
There are several ways to build Minimum Spanning
Tree (MST) from the graph [11]. Two popular ways to
implement the algorithms are the agglomerative and the
divisive procedures.
The well-known agglomerative procedures are the
Kruskal and Prim’s algorithms. The first one works by
constructing the tree from initial N isolated vertices of
the original graph. All the edge s are so rted into a non-d e-
creasing order by their weights. For each edge which is
not in the tree, if this edge does not form a cycle with the
current tree, then we can add this edge to the tree. In the
Prim’s algorithm, the tree construction starts with a root
node. At each step, among all the edges between the
nodes in the tree T and those which are not in the tree yet,
the node and the edge associated with the smallest weight
to the tree T are added.
The second kind of algorithm is the divisive one called
the reverse delete algorithm starting with the full graph.
Edges are deleted in order of nonincreasing weights bas-
ed on the cycle property as long as keeping the connec-
tivity of the graph.
Some well-known properties of MST are summarized
in the following theorem [12].
Theorem 3.1. The minimum spanning tree T(G) of a
graph G has the following properties.
1) T contains at least one edge from (,)PQ
for each
partition P, Q.
2) Each edge of T is a link of some partition of G.
3) Let (C1,C2) be a partitio n of G. If 12
(,)( ,)PQC C
for each partition (P, Q) of C1, then T (C1) forms a con-
nected subtree of T (G).
Once we have the MST, we can obtain the desired
clustering by removing inconsistent edges of MST. The
simplest way to define inconsistent edges is using weight
measure ratio of the edge with average weight of nearby
edges in the tree [12]. If the ratio is larger than a thresh-
old, then it is incon sistent. We can determine a stop crite-
ria by the number of clusters, or a minimum size of any
cluster by removing edges which can result in two clusters
whose sizes are larger than the minimum cluster size.
If we know the number of clusters k, then clustering
can start by removing k-1 arbitrary edges from the tree,
creating a k-partition. Then we can minimize the change
of the total weight of the current clusters to obtain the
final clustering.
To reduce comp utation co mplex ity, Xiaochun Wang et
al. proposed a divide and conquer approach [11]. Given a
loose estimate of minimum and maximum numbers of
data items in each cluster, they propose an iterative ap-
proach for MST clustering algorithm in five steps: 1) Sta rt
with a spanning tree built by the sequential initialization
(SI). 2) Calculate the mean and the standard deviation of
the edge weights in the current distance array and use
their sum as the threshold. Partially refine the spanning
tree by running Divisive Hierarchical Clustering Algo-
rithm (DHCA) multiple times until the percentage
threshold difference between two consecutively updated
distance arrays is below 6
. 3) Identify and verify the
longest edge candidates by running MDHCA until two
consecutive longest edge distances converge to the same
value at the same places. 4) Remove this longest edge. 5)
If the number of clusters in the data set is preset or if the
difference between two consecutively removed longest
edges has a percentage decrement larger than 50 percent
of the previous one, we sto p. Ot herwise go to Step 3.
However, if the graph size is not large, we can directly
get clustering from the graph. When we use the flexible
grids technique to obtain the graph, this is often th e case.
Anyway, the technique of [11] can be applied to further
reduce the computing time.
4. Grid Based Spatial Clustering
The grid based clustering uses a multi-resolution grid
structure which contains the data objects and acts as op-
erands of clustering performance [1]. For example, the
authors [13] propose a gravity based grid which ap-
proximates the cell influence by gravity centers. The au-
thors claim that the proposed technique can reduce
memory usage and simplify computational complexity
with minor loses of the clustering accuracy.
Traditional grids are regular hypercubic grid. This re-
quires the grid construction cover all the data space with
the same precision. The second method uses flexible
grids, i.e. multi-resolution grids with hypercubic or hy-
per-rectangular cells having randomly oriented borders
[14]. The main clustering technique is a tree based
searching with a similarity measure composed of both
the density and distance differences [15].
Suppose the data set is 1
[, ,}n
. It co-
ntains in a rectangle in . A grid is a graph G
where each node is a complex v = ( D, norm, c, is-
Crowded, p), where D is the polyhedra, norm is a unit
vector indicating normal vector of previous cutting plane,
c is a point which lies in the grid acting as its center, is-
Crowded is 1 or 0 indicating whether the grid is highly
populated or not, and p is the number of da ta points cov-
ered by the grid. The initial grid is D0 with an arbitrary
normal vector. In each step, we can define the center of
the grid as its geometrical center.
For two nodes (
vD , ,,,)
ii i
norm c isCrowdedpi
, 1, 2i
xists a connecting edge between them. The weight
value on this edge is defined as
there e
12 12
0,if min{,}0
(, )(,), else
DD vv
Copyright © 2009 SciRes JSSM
A Projection Clustering Technique Based on Projection
Copyright © 2009 SciRes JSSM
The graph is constructed in an iterative way. We start
with an initial candidate node
where D is the origi
graph into clusters. A commonly used technique to deal
with this problem is the hierarchical clustering [1]. Ag-
glomerative methods iteratively connect vertices that are
close to each other with edges. Small clusters are merged
with each other building up the hierarchical structure to
find the desired larger clustering structure. Divisive
methods on the contrary are based on network flows.
This is done by iteratively identifying the intercluster
edges in a top-down approach.
nal rectangle, norm0
p is the total popula-
,)Crowded p
tion number. T
is a random selected unit vector, c is the geometrical
center of D0, is Crowded = 1, and
hen at each step, the cell containing more
number of points (controlled by a threshold value p
, or
larger enough by diameter) controlled by another thresh-
old value d
, is split into two subcells by a hyperplane
which is orthogonal to the current normal vector.si-
tion of the hyperplane is random. A cell is called
crowded if its population is larg er than p
. Otherwise it
is called sparse. If we reach a sparse cell, then add this
cell to the node set of the graph. If we reach a cell with
diameter less than d
, then add this celle node set.
This step continues until each cell has a population
less than p
to th
, or its diameter is smaller than d
. Table 1
gives the algorithmr the graph construction process.
Once we have completed the graph construction, those
nodes in the graph which are not crowded will corre-
spond to vacant area or outliers. Therefore, in order to
reduce computing complexity, we first remove all sparse
graph nodes with corresponding edges. The resulting
graph is G = (V, E) where V is the set of vertices, E the
set of weighted edges. An example is shown in Figure 1
with part of its edge s.
Now we use C(X) = {Xq: q = 1, 2,…k} to denote a
clustering of the data set X where O is the set o f outliers.
ithph e
By this algorithm, we can generate a hierarchical grid
together w a resulting graph. When the gra is gener-
ated, the clustering will become grouping nodes of th
s construction algorithm
Algorithm: Construction of flexible grids
Table 1. Flexible grid
1, ,}
x x dataset of N points in n
D: hyper-rectangle co n ta in in g X
: population threshold value.
: cell diameter threshold value
{, , }
Vv v set of vertices
(0){(,, ,1,)}VvDncp . Let t = 0.
for each v = (D, n, c, isCrowded, p)
Generate a cutting hyperplane L passing c and with normal
vector parallel to n. Cut the current cell v into two subcells
,vv. For each new node, if p < p
or diam(D) < d
add this new cell to the node set V. Else add it to V(t + 1).
Let the new norms be orthogonal to n.
t + +;
A Projection Clustering Technique Based on Projection
Figure 1. An example clustering area three clusters
For given population threshold value and diameter
threshold value, we can generate the flexible grid and
obtain a graph. Then we can generate a minimum span-
ning tree. Then we can get the final clustering. Figure 1
shows an example of minimum spanning tree corre-
sponding to the data set in Figure 1.
Consequently, for fixed
and p
, the data set X is
clustered as (, ,
CX )
. We define||
as e
number of cells in Xq. Define the energy of clustering as
the sum of intra-cluster variation measure and in-
ter-cluster distance as
|| ()()
min (,)
ij q
qvv qi
vv X
kXdiam Ddiam D
By this grid based method, the final clustering can be
treated without direct computation to the data points and
reduce the number of nodes significantly. The only p-
ne way to choose these two parameters is to optimize
the energy of clustering.
5. A Performance Study
e twe want to cluster a data set with
N objects. By the graph construction al the
ious section, the data
meters we need to determine beforehand is the cel
wded threshold value and the minimum cell diameter. cro
Supposhat n
gorithm in
set is split inier-
chy. A minimum spanning tree is constructed associ-
ated with the cell grap h.
make things simpler, we will assume that the cut-
ting planes are perpendicular to one of the axis in this
section. Moreover, we assume that each cutting plane
through the geometrical center of the current cell.
Therefore, all the cells are rectangles. Then we can easily
count the total number of nodes of the graph.
se the original data cube D0 has a diameter
to a cell h
Suppo 0
and the initial popu lation threshold is p
. For some
large number M, let
. L etficiently 0
be another decreasing
priate two sequences, we c
For pecific populion
rameters ,
sequence. Bhoosing apy cpro-
an optimize the clustering.
sat and diameter threshold pa-
, let the induced graph be G(,
) with
ng tree T(minimum spanni,
). A cluste of T is
denoted b
y C(,
) is a set disjoint subtree whose
node with thriginal truse
e o
ee. We s set coin
C to denote the clustof the data
Evidently we have the fo llowing properties.
ering set.
Theorem 5.1. Clusters have two properties.
1) Anti-joint property. If 12
, then two disjoint
clusters in 1
(, )
C are disjoint in 2
(, )
2) Monotonicity property. If 12
, then a cluster in
C cannot be disintegrated in 2
ulatNow we assume the popion throld esh
is a con-
stant which determines thgranularity of the problem.
Therefore we usee
C tog denote the clus. Let us
split the energy into two parts, the intra-cluster energy
and the intercluster energy ()
as follows
|| ()()
ij q
qvv qi
vv X
 (11)
(,)min (,)
Theorem 5.2. Inter-cluster energy is monotone. That is
to say, if 12
, then 12
(, )(,)
ie ie
Proof: It is clear to see that when 12
, a cell
maybe split into small cells. Therefo re, either the set i
will be smaller which means that the distance
(, )
will become larger.
In a recent work [11] the authors propose a divide and
Copyright © 2009 SciRes JSSM
A Projection Clustering Technique Based on Projection
Copyright © 2009 SciRes JSSM
g of two phases. The
tial initialization and the
se uses
conquer based algorithm consistin
first phase includes the sequen
spanning tree updating, and the second pha some
technique to locate the longest edges and partitions the
obtained approximate minum spanning tree to form
sensible clusters. The authors expect the first phase has
is constant. The average time
complexity of the second phase is (log)OeNN where e
is constant. Therefore their expectation of time complex-
ity is(log)ONN . However, our algorithm do provide a
time complexity of (log)ONN .
Theorem 5.3. Suppose thresholds are p
. Then the time complexity of the flexible gs
construction algorithm is (log)ONN .
Proof: At each stage, if a,d
) is sparse, i.e., isCrowded =0, then the cell is a node
in the graph. Otherwise, a cutting hyperplane L passing c
ll v
(,,,vD n c isCrowde
and with normal vector parallel to n is generated. The
current ce is cut into two subcells 12
,vv r each
new node, if p
. Fo
or () d
d Diam
, add this new
cell to the node.
In this process, the computation of isCrowded and p
e a time of ()
ON where i
N is the nu
set V
both havmber of
ceta points in the new cell. Ideally, the plane L passes the
nter of the cell v. Hence /2
for i = 1; 2. If not
so xpect th. Let thotal time
co we
2T that T(
d arc
structio re rela-
cated. The main ingredient here is the application of grid
clustering to pr oject i o n cl ust er i ng.
Apparently, this research will lead to efficie
rithms. In future work, we will give experimental study
on the new technique. This will be lengthy, for the clus-
tering is essentially an optimization process. The best
population threshold
, the plane is randomly placed by a uniform distribution
which we ee same propertye t
mplexity be T(N). Thenhave T(N) =
(N=2)+O(N). Hence we knowN) =O(N log N).
In this paper we present a new projection clustering tech-
nique based on grihitecture and minimum spanning
tree. The effective using of minimum spanning tree can
possibly reduce computing complexity although the con-
n of the graph and the tree atively compli
nt algo-
is to be determin
timizes the clustering energy presented in Section 3. For
th , oith an application.
Research is e Natural Science
Foundation of Che Natural Science
techniqu d
er, and X. Xu
, OR, pp. 226.
[3] novel
me 1.
[4] enberg, “Gene expression
, Septemctober 2002.
. M. Cheung, and J. Huang, “Agglom-
mixed typ
09, 2005.
Y. Zha, “Principal manifolds and
ation approach
ing algorithms,” Journal of
pp. 1267–1274, 2003.
-based clustering,” J. N. Kok et
the Science and Technology Project of Shandong Educa-
tion Bureau.
[1] J. W. Han and M. Kamber, “Data mining concepts and
es,” 2nd Eition, Elsevier, Singapore, 2006.
[2] M. Ester, H. P. Kriegel, J. Sand, “A den-
sity-based algorithm for discovering clusters in large spa-
tial databases with noise,” in Proceedings 2nd Interna-
tional Conference on Knowledge Discovery and Data
Mining, Portland–231, 1996
R. Bar-Or and C. van Woudenberg, “Agrav-
ity-based clustering method, technical report,” Depart-
nt of Applied Mathematics, University of Colorado,
Denver, CO 80202, 200
R. Bar-Or and C. van Woud
analysis with a novel gravity-based clustering method,”
pp.1–46, December 2001.
[5] R. T. Ng and J. Han, “Clarans: A method for clustering
objects for spatial data mining,” IEEE Trans. on Knowl-
edge and Data Engineering, Vol. 14, No. 5, pp.
1003–1016 ber/O
[6] M. Li, M. K. Ng, Y
erative fuzzy K-Means clustering algorithm with selection
of number of clusters,” IEEE Trans on Knowledge and
Data Engineering, Vol. 20, No. 11, pp. 1519–1534, 2008.
[7] Z. He, X. Xu, and S. Deng, “Scalable algorithms for clus-
tering large datasets withe attributes,” Interna-
tional Journal of Intelligent Systems, Vol. 20, pp.
1077–1089, 2005.
[8] Z. F. He and F. L. Xiong, “A constrained partition model
and K-Means algorithm,” Journal of Software, Vol.16,
No.5, pp. 799–8
[9] Z. Y. Zhang and H. nonlinear dimensionality reduction via tangent space
alignment,” SIAM Journal of Scientific Computing, Vol.
26, No. 1, pp. 313–338, 2004.
[10] M. Bouguessa and S. R. Wang, “Mining projected clus-
ters in high-dimensional spaces,” IEEE Transaction on
Knowledge and Data Engineering, Vol. 21, No. 4, pp.
507–522, 2009.
[11] X. C. Wang, X. L. Wang, and D. M. Wilkes, “A di-
vide-and-conquer approach for minimum spanning tree-
based clustering,” IEEE Trans on Knowledge and Data
Engineering, Vol. 21, No. 7, pp. 945–958, 2009.
[12] C. T. Zahn, “Graph-theoretical methods for detecting and
describing gestalt clusters,” IEEE Trans. Computers, Vol.
20, No. 1, pp. 68–86, January 1971.
[13] C. H. Li and Z. H. Sun, “A mean approxim
ed which op-
is reason, we will present this part of the research in
another papertgether wto a class of grid-based cluster
Software, Vol. 14, No. 7,
This project is carried out under the “Taishan Schol
project of Shandong, China.
also supported by th
ina (No.6087305 8), th
[14] Marc-Ismael, Akodj`enou-Jeannin, Kav´e Salamatian, and
P. Gallinari, “Flexible grid
al. (Eds.), LNAI 4702, PKDD, pp. 350–357, 2007.
[15] U. Brandes, M. Gaertler, and D. Wagner, “Experiments on
graph clustering algorithms,” In ESA, pp. 568–579, 2003.
undation of Shandong Province (No. Z2007G03), and