### Paper Menu >>

### Journal Menu >>

J. Service Science & Management, 2009, 2: 362-367 doi:10.4236/jssm.2009.24043 Published Online December 2009 (www.SciRP.org/journal/jssm) Copyright © 2009 SciRes JSSM A Projection Clustering Technique Based on Projection Xiyu LIU1, Xinjiang XIE 2, Wenping WANG1 1School of Management and Economics, Shandong Normal University, Jinan, China; 2Department of Computer Science and Tech- nology, Shandong Agricultural Administrators College, Jinan, China. Email: xyliu@sdnu.edu.cn, wenping216@126.com, xjxie@163.com Received July 24, 2009; revised September 10, 2009; accepted October 20, 2009. ABSTRACT Projection clustering is an important cluster problem. Although there are extensive studies with proposed algorithms and applications , one of the basic computing arch itectures is that they are all at the level of data objects. The purp ose of this paper is to propose a new clustering technique based on grid architecture. Our new technique integrates mini- mum spanning tree and grid clustering together. By this integration of projection clustering with grid technique, the complexity of compu ting is lowered to . O(NlogN) Keywords: Cluster Analysis, Projection Clustering, Grid Clustering 1. Introduction Cluster analysis is as an important area in data mining which can explore the hidden structures of business da- tabases [1]. Traditionally, cluster analysis can be catego- rized as three classes. Partitioning method works by con- structing various partitions and then evaluating them by some criterion. Hierarchy method creates a hierarchical decomposition of the set of data (or objects) using some criterion. Density-based method is based on connectivity and density functions. Grid-based method is based on a multiple-level granularity structure. Model-based method is to construct a model and to find the best fit model. Along these lines, many techniques and algorithms have been proposed in the literature. For example, Ester et al. [2] present the density-based clustering algorithm which uses an Eps-neighborhood for a point containing at least a minimum number of points. Raphael Bar-Or and Christiaan van Woudenberg [3,4] present a grav- ity-based clustering method intended to find clusters of data in n-space. The most classical clustering technique is due to Raymond T. Ng and Jiawei Han [5] who devel- oped a CLARANS which aims to use randomized search to facilitate the clustering of a large number of objects. More recent work include agglomerative fuzzy K-Means clustering algorithm by introducing a penalty term to the objective function to make the clustering process insensi- tive to the initial cluster centers [6]. Among all these clustering techniques, one of the basic measurements is the Euclidean distance. It requires simi- lar objects to have close values in all dimensions. When similarity between objects in high dimensional space is absent, this kind of technique is often invalid. To solve this problem, dimension reduction and manifold learning is applied [7–9]. Another method for this skewed data is the projection clustering [10]. The main idea of projected clustering is that different clusters may distribute along part of the dimensions. A projected cluster is a subset of data points, together with a subspace of dimensions, so that the points are closely clustered in the subspace. Different with the above clustering approaches, graph clustering works by transfo rming the initial working d ata into a kind of graph. Then graph clustering techniques can be app lied to ob tain the final clustering. One of these techniques is the Minimum Spanning Tree (MST) based clustering. A lthough the f irst MST-b ased clustering algo- rithms have been studied for many years, due to its com- putational efficiency for large databases, it attracts new researches frequently. In a more recent work [11], the authors present a more efficient method based on the divide and conquer approach that can quickly identify the longest edges in an MST so as to save some computa- tions. The experimental results show that their MST in- spired clustering algorithm is very effective and stable when applied to various clustering problems. The authors also expect that their algorithms have a computing time. (log)ON N In this paper, we propose a new projection clustering technique by Minimum Spanning Tree based on the grid A Projection Clustering Technique Based on Projection363 clustering approach. Basically, our MST-inspired clus- tering technique works on cells rather than data points directly. This will significantly reduce the size of graph and MST. Due to this reason, our technique has no spe- cific requirements on the dimensionality of the data sets. This is different from some typical projection clustering algorithms [10]. The rest of this paper is organized as follows: Section 2 presents the basic idea of projection clustering. In Sec- tion 3, we summarize some basics on MST-based clus- tering techniques. In Section 4, we propose a flexible grid clustering approach. Clustering is also discussed as an optimization process in this section. Section 5 contains a brief clusteri ng behavior and tim e complexity analysis. 2. Projection Cell Clustering Suppose the raw data set is 1 {, ,} N X xx 1 (,, ) n . Each point has n components by x xx. It is contained in a rectangle D0 in Rn. Generally, we will not cluster the original data set X. Instead, we will consider a derived data set X composed of data cells. However, after we transforme the data set into cells, each cell can be considered as a new data point represented by its center. The nu mber of data points in the cell is called the mass of the cell. More precisely, a cell (grid) is a polyhedron as a com- plex x = (D, norm, c, b, p), where D is the polyhedron, norm is a unit vector indicating normal vector of one of its edges, c is the center, b is a boolean value indicating whether the grid is dense or sparse, and p is the number of data points covered by the grid. In order to simplify symbols, we will use as cell data object and xX [] x X as the data points defined by x in the or iginal data space X. For two objects x, y, the dis- tance is defined as the minimum distance of the two cells [], [] (, )min(,) pxqy x ydpq (1) The diameter of a data object is measurement of its size ,[] 1 ()max(,) 2pqx x pq (2) Let N(x) be the set of k-nearest neighbors (KNN) of x including itself, then the number of object in N(x) is k+1. The sparseness or thickness of the data object can be measured by the relative location of its k-nearest neighbors () 1 () (,) zNx x zx k (3) Suppose there is a mass function defined on by ; (total number of points). Then we define the density of a data point as X :mR X() #[]mxx p () 1 () () () xNx x mx x (4) Suppose x X . We use i to denote the pr ojection operator in the i-th component, i.e., () i ii x xx i . Respectively, (,dx ) () ii y dy,x . For ,xy X, de- fine {: ii[]} x zz x (,) i , ( ,) ii x yxy , and () ( i) i x x . Then we consider projection into the i-th component, the KNN neighbor set N(x) is replaced by 1 { , ii xy()Nx ,,: ii y k y are the k-nearest points to i x }. The corresponding definition of sparseness and density are () 1 () (,) i ii zNx x zx k , () 1 () () () i iixN x x mx x (5) Now we describe the process of projected clustering. The main idea is that distance between data objects is restricted to subsets of dimensions where object values are dense [10]. This means that we only consider contri- butions of relevant dimensions when computing the dis- tance between data point and the cluster center. Different from [10], we use a fixed threshold value to determine the dense and sparse dimensions. Now let x X to be a cell data object. Suppose 00 is a positive threshold determined in the process of griding process. Define a matrix [] ijN n by1 0 , 1,if( );1,, 0, else j xj xj n (6) By this index matrix we obtain a projected cell dis- tance as follows 1 2 2 ,, 1 (, )((, )) n xj yjj j xy xy (7) 3. Minimum Spanning Trees Let G=(V, E) be a connected, undirected edge-weighted graph with N nodes as before. W = [wij] is the weight matrix. For any , we use Q = G-P to denote the subgraph generated by the vertices V\P called a partition of nodes. A spanning subgraph is a subgraph that contains all the vertices of the original graph. A minimal spanning tree of graph G is a spanning graph with no circuits whose weight is minimum among all spanning t rees of G. PV For a partition P, Q of G, define (,)PQ as the smallest weight among all edges from the cut-set C (P, Q), which is the set of edges connecting P and Q. A link is any edge in C(P, Q) with weight (,)PQ . The link set is denoted by (,)PQ [12]. 1Notice that the size of this matrix maybe less than N since we are using cells instead of data points. Copyright © 2009 SciRes JSSM A Projection Clustering Technique Based on Projection 364 There are several ways to build Minimum Spanning Tree (MST) from the graph [11]. Two popular ways to implement the algorithms are the agglomerative and the divisive procedures. The well-known agglomerative procedures are the Kruskal and Prim’s algorithms. The first one works by constructing the tree from initial N isolated vertices of the original graph. All the edge s are so rted into a non-d e- creasing order by their weights. For each edge which is not in the tree, if this edge does not form a cycle with the current tree, then we can add this edge to the tree. In the Prim’s algorithm, the tree construction starts with a root node. At each step, among all the edges between the nodes in the tree T and those which are not in the tree yet, the node and the edge associated with the smallest weight to the tree T are added. The second kind of algorithm is the divisive one called the reverse delete algorithm starting with the full graph. Edges are deleted in order of nonincreasing weights bas- ed on the cycle property as long as keeping the connec- tivity of the graph. Some well-known properties of MST are summarized in the following theorem [12]. Theorem 3.1. The minimum spanning tree T(G) of a graph G has the following properties. 1) T contains at least one edge from (,)PQ for each partition P, Q. 2) Each edge of T is a link of some partition of G. 3) Let (C1,C2) be a partitio n of G. If 12 (,)( ,)PQC C for each partition (P, Q) of C1, then T (C1) forms a con- nected subtree of T (G). Once we have the MST, we can obtain the desired clustering by removing inconsistent edges of MST. The simplest way to define inconsistent edges is using weight measure ratio of the edge with average weight of nearby edges in the tree [12]. If the ratio is larger than a thresh- old, then it is incon sistent. We can determine a stop crite- ria by the number of clusters, or a minimum size of any cluster by removing edges which can result in two clusters whose sizes are larger than the minimum cluster size. If we know the number of clusters k, then clustering can start by removing k-1 arbitrary edges from the tree, creating a k-partition. Then we can minimize the change of the total weight of the current clusters to obtain the final clustering. To reduce comp utation co mplex ity, Xiaochun Wang et al. proposed a divide and conquer approach [11]. Given a loose estimate of minimum and maximum numbers of data items in each cluster, they propose an iterative ap- proach for MST clustering algorithm in five steps: 1) Sta rt with a spanning tree built by the sequential initialization (SI). 2) Calculate the mean and the standard deviation of the edge weights in the current distance array and use their sum as the threshold. Partially refine the spanning tree by running Divisive Hierarchical Clustering Algo- rithm (DHCA) multiple times until the percentage threshold difference between two consecutively updated distance arrays is below 6 10 . 3) Identify and verify the longest edge candidates by running MDHCA until two consecutive longest edge distances converge to the same value at the same places. 4) Remove this longest edge. 5) If the number of clusters in the data set is preset or if the difference between two consecutively removed longest edges has a percentage decrement larger than 50 percent of the previous one, we sto p. Ot herwise go to Step 3. However, if the graph size is not large, we can directly get clustering from the graph. When we use the flexible grids technique to obtain the graph, this is often th e case. Anyway, the technique of [11] can be applied to further reduce the computing time. 4. Grid Based Spatial Clustering The grid based clustering uses a multi-resolution grid structure which contains the data objects and acts as op- erands of clustering performance [1]. For example, the authors [13] propose a gravity based grid which ap- proximates the cell influence by gravity centers. The au- thors claim that the proposed technique can reduce memory usage and simplify computational complexity with minor loses of the clustering accuracy. Traditional grids are regular hypercubic grid. This re- quires the grid construction cover all the data space with the same precision. The second method uses flexible grids, i.e. multi-resolution grids with hypercubic or hy- per-rectangular cells having randomly oriented borders [14]. The main clustering technique is a tree based searching with a similarity measure composed of both the density and distance differences [15]. Suppose the data set is 1 [, ,}n N X xxR n R . It co- ntains in a rectangle in . A grid is a graph G where each node is a complex v = ( D, norm, c, is- Crowded, p), where D is the polyhedra, norm is a unit vector indicating normal vector of previous cutting plane, c is a point which lies in the grid acting as its center, is- Crowded is 1 or 0 indicating whether the grid is highly populated or not, and p is the number of da ta points cov- ered by the grid. The initial grid is D0 with an arbitrary normal vector. In each step, we can define the center of the grid as its geometrical center. 0 D For two nodes ( ii vD , ,,,) ii i norm c isCrowdedpi , 1, 2i xists a connecting edge between them. The weight value on this edge is defined as there e (8) 12 12 12 0,if min{,}0 (, )(,), else pp DD vv Copyright © 2009 SciRes JSSM A Projection Clustering Technique Based on Projection Copyright © 2009 SciRes JSSM 365 The graph is constructed in an iterative way. We start with an initial candidate node where D is the origi graph into clusters. A commonly used technique to deal with this problem is the hierarchical clustering [1]. Ag- glomerative methods iteratively connect vertices that are close to each other with edges. Small clusters are merged with each other building up the hierarchical structure to find the desired larger clustering structure. Divisive methods on the contrary are based on network flows. This is done by iteratively identifying the intercluster edges in a top-down approach. 00 (,vD nal rectangle, norm0 p is the total popula- 0 ,,normcis ,)Crowded p tion number. T 0 is a random selected unit vector, c is the geometrical center of D0, is Crowded = 1, and hen at each step, the cell containing more number of points (controlled by a threshold value p , or larger enough by diameter) controlled by another thresh- old value d , is split into two subcells by a hyperplane which is orthogonal to the current normal vector.si- tion of the hyperplane is random. A cell is called crowded if its population is larg er than p Po . Otherwise it is called sparse. If we reach a sparse cell, then add this cell to the node set of the graph. If we reach a cell with diameter less than d , then add this celle node set. This step continues until each cell has a population less than p to th , or its diameter is smaller than d . Table 1 gives the algorithmr the graph construction process. Once we have completed the graph construction, those nodes in the graph which are not crowded will corre- spond to vacant area or outliers. Therefore, in order to reduce computing complexity, we first remove all sparse graph nodes with corresponding edges. The resulting graph is G = (V, E) where V is the set of vertices, E the set of weighted edges. An example is shown in Figure 1 with part of its edge s. Now we use C(X) = {Xq: q = 1, 2,…k} to denote a clustering of the data set X where O is the set o f outliers. Then fo ithph e By this algorithm, we can generate a hierarchical grid ()CCX X O together w a resulting graph. When the gra is gener- ated, the clustering will become grouping nodes of th C s construction algorithm Algorithm: Construction of flexible grids (9) Table 1. Flexible grid Inputs 1, ,} N x x dataset of N points in n {X R . D: hyper-rectangle co n ta in in g X p : population threshold value. d : cell diameter threshold value Outputs 1 {, , } N Vv v set of vertices Begin 00000 (0){(,, ,1,)}VvDncp . Let t = 0. while for each v = (D, n, c, isCrowded, p) ()Vt V(t) Generate a cutting hyperplane L passing c and with normal vector parallel to n. Cut the current cell v into two subcells 12 ,vv. For each new node, if p < p or diam(D) < d , add this new cell to the node set V. Else add it to V(t + 1). Let the new norms be orthogonal to n. end t + +; end End A Projection Clustering Technique Based on Projection 366 Figure 1. An example clustering area three clusters For given population threshold value and diameter threshold value, we can generate the flexible grid and obtain a graph. Then we can generate a minimum span- ning tree. Then we can get the final clustering. Figure 1 shows an example of minimum spanning tree corre- sponding to the data set in Figure 1. Consequently, for fixed with d and p , the data set X is clustered as (, , dp CX ) . We define|| q X as e number of cells in Xq. Define the energy of clustering as the sum of intra-cluster variation measure and in- ter-cluster distance as th 2 1, 1 (,,) 11 || ()() 1 min (,) ij ij q pd kj i qvv qi vv X ij ji ik EX p p kXdiam Ddiam D XX j (10) By this grid based method, the final clustering can be treated without direct computation to the data points and reduce the number of nodes significantly. The only p- ral ne way to choose these two parameters is to optimize the energy of clustering. 5. A Performance Study e twe want to cluster a data set with N objects. By the graph construction al the ious section, the data a meters we need to determine beforehand is the cel wded threshold value and the minimum cell diameter. cro O Supposhat n R gorithm in prev set is split inier- chy. A minimum spanning tree is constructed associ- ated with the cell grap h. make things simpler, we will assume that the cut- ting planes are perpendicular to one of the axis in this section. Moreover, we assume that each cutting plane through the geometrical center of the current cell. Therefore, all the cells are rectangles. Then we can easily count the total number of nodes of the graph. se the original data cube D0 has a diameter to a cell h ar To passes Suppo 0 , and the initial popu lation threshold is p . For some large number M, let suf- . L etficiently 0 2,1,2,, i iiM i be another decreasing priate two sequences, we c For pecific populion rameters , sequence. Bhoosing apy cpro- an optimize the clustering. sat and diameter threshold pa- , let the induced graph be G(, ) with ng tree T(minimum spanni, ). A cluste of T is denoted b ring y C(, cide ) is a set disjoint subtree whose node with thriginal truse of e o s ee. We s set coin (,) C to denote the clustof the data Evidently we have the fo llowing properties. ering set. Theorem 5.1. Clusters have two properties. 1) Anti-joint property. If 12 , then two disjoint clusters in 1 (, ) C are disjoint in 2 (, ) C. 2) Monotonicity property. If 12 , then a cluster in 1 (,) C cannot be disintegrated in 2 (,) C. ulatNow we assume the popion throld esh is a con- stant which determines thgranularity of the problem. Therefore we usee () C tog denote the clus. Let us split the energy into two parts, the intra-cluster energy () ia E terin and the intercluster energy () ie E as follows 2 1, (,) 11 || ()() ij ij q ia kj i qvv qi vv X EX j p p kX diamDdiamD (11) 1 1 (,)min (,) ie ij ji ik EX X X (12) Theorem 5.2. Inter-cluster energy is monotone. That is to say, if 12 , then 12 (, )(,) ie ie EX EX . Proof: It is clear to see that when 12 , a cell maybe split into small cells. Therefo re, either the set i X will be smaller which means that the distance (, ) ij ji X X will become larger. In a recent work [11] the authors propose a divide and Copyright © 2009 SciRes JSSM A Projection Clustering Technique Based on Projection Copyright © 2009 SciRes JSSM 367 g of two phases. The tial initialization and the se uses im conquer based algorithm consistin first phase includes the sequen spanning tree updating, and the second pha some technique to locate the longest edges and partitions the obtained approximate minum spanning tree to form sensible clusters. The authors expect the first phase has (log)OfNNwhere f is constant. The average time complexity of the second phase is (log)OeNN where e is constant. Therefore their expectation of time complex- ity is(log)ONN . However, our algorithm do provide a time complexity of (log)ONN . Theorem 5.3. Suppose thresholds are p and d . Then the time complexity of the flexible gs construction algorithm is (log)ONN . Proof: At each stage, if a,d ) is sparse, i.e., isCrowded =0, then the cell is a node in the graph. Otherwise, a cutting hyperplane L passing c rid cell ll v (,,,vD n c isCrowde p and with normal vector parallel to n is generated. The current ce is cut into two subcells 12 ,vv r each new node, if p p . Fo or () d d Diam , add this new cell to the node. In this process, the computation of isCrowded and p e a time of () i ON where i N is the nu set V both havmber of da ceta points in the new cell. Ideally, the plane L passes the nter of the cell v. Hence /2 i Np for i = 1; 2. If not so xpect th. Let thotal time co we 2T that T( 6. d arc structio re rela- cated. The main ingredient here is the application of grid clustering to pr oject i o n cl ust er i ng. Apparently, this research will lead to efficie rithms. In future work, we will give experimental study on the new technique. This will be lengthy, for the clus- tering is essentially an optimization process. The best population threshold , the plane is randomly placed by a uniform distribution which we ee same propertye t mplexity be T(N). Thenhave T(N) = (N=2)+O(N). Hence we knowN) =O(N log N). Conclusions In this paper we present a new projection clustering tech- nique based on grihitecture and minimum spanning tree. The effective using of minimum spanning tree can possibly reduce computing complexity although the con- n of the graph and the tree atively compli nt algo- p is to be determin timizes the clustering energy presented in Section 3. For th , oith an application. 7. ar” Research is e Natural Science Foundation of Che Natural Science Fo techniqu d er, and X. Xu , OR, pp. 226. [3] novel me 1. [4] enberg, “Gene expression , Septemctober 2002. . M. Cheung, and J. Huang, “Agglom- mixed typ 09, 2005. Y. Zha, “Principal manifolds and ation approach ing algorithms,” Journal of pp. 1267–1274, 2003. -based clustering,” J. N. Kok et the Science and Technology Project of Shandong Educa- tion Bureau. REFERENCES [1] J. W. Han and M. Kamber, “Data mining concepts and es,” 2nd Eition, Elsevier, Singapore, 2006. [2] M. Ester, H. P. Kriegel, J. Sand, “A den- sity-based algorithm for discovering clusters in large spa- tial databases with noise,” in Proceedings 2nd Interna- tional Conference on Knowledge Discovery and Data Mining, Portland–231, 1996 R. Bar-Or and C. van Woudenberg, “Agrav- ity-based clustering method, technical report,” Depart- nt of Applied Mathematics, University of Colorado, Denver, CO 80202, 200 R. Bar-Or and C. van Woud analysis with a novel gravity-based clustering method,” pp.1–46, December 2001. [5] R. T. Ng and J. Han, “Clarans: A method for clustering objects for spatial data mining,” IEEE Trans. on Knowl- edge and Data Engineering, Vol. 14, No. 5, pp. 1003–1016 ber/O [6] M. Li, M. K. Ng, Y erative fuzzy K-Means clustering algorithm with selection of number of clusters,” IEEE Trans on Knowledge and Data Engineering, Vol. 20, No. 11, pp. 1519–1534, 2008. [7] Z. He, X. Xu, and S. Deng, “Scalable algorithms for clus- tering large datasets withe attributes,” Interna- tional Journal of Intelligent Systems, Vol. 20, pp. 1077–1089, 2005. [8] Z. F. He and F. L. Xiong, “A constrained partition model and K-Means algorithm,” Journal of Software, Vol.16, No.5, pp. 799–8 [9] Z. Y. Zhang and H. nonlinear dimensionality reduction via tangent space alignment,” SIAM Journal of Scientific Computing, Vol. 26, No. 1, pp. 313–338, 2004. [10] M. Bouguessa and S. R. Wang, “Mining projected clus- ters in high-dimensional spaces,” IEEE Transaction on Knowledge and Data Engineering, Vol. 21, No. 4, pp. 507–522, 2009. [11] X. C. Wang, X. L. Wang, and D. M. Wilkes, “A di- vide-and-conquer approach for minimum spanning tree- based clustering,” IEEE Trans on Knowledge and Data Engineering, Vol. 21, No. 7, pp. 945–958, 2009. [12] C. T. Zahn, “Graph-theoretical methods for detecting and describing gestalt clusters,” IEEE Trans. Computers, Vol. 20, No. 1, pp. 68–86, January 1971. [13] C. H. Li and Z. H. Sun, “A mean approxim ed which op- is reason, we will present this part of the research in another papertgether wto a class of grid-based cluster Software, Vol. 14, No. 7, Acknowledgments This project is carried out under the “Taishan Schol project of Shandong, China. also supported by th ina (No.6087305 8), th [14] Marc-Ismael, Akodj`enou-Jeannin, Kav´e Salamatian, and P. Gallinari, “Flexible grid al. (Eds.), LNAI 4702, PKDD, pp. 350–357, 2007. [15] U. Brandes, M. Gaertler, and D. Wagner, “Experiments on graph clustering algorithms,” In ESA, pp. 568–579, 2003. undation of Shandong Province (No. Z2007G03), and |