Paper Menu >>
Journal Menu >>
J. Software Engineering & Applications, 2010, 3: 141-149 doi:10.4236/jsea.2010.32018 Published Online February 2010 (http://www.SciRP.org/journal/jsea) Copyright © 2010 SciRes JSEA 141 A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation Xiankun Yang1,2, Weihong Cui1 1Institute of Remote Sensing Application, Chinese Academy of Sciences, Beijing, China; 2Graduate University of Chinese Academy of Sciences, Beijing, China. Email: xiankungis@163.com. Received August 17th, 2009; revised September 16th, 2009; accepted November 13th, 2009. ABSTRACT Exploratory data analysis is increasingly more necessary as larger spatial data is managed in electro-magnetic media. Spatial clustering is one of the very important spatial data mining techniques which is the discovery of interesting rela- tionships and characteristics that may exist implicitly in spatial databases. So far, a lot of spatial clustering algorithms have been proposed in many applications such as pattern recognition, data analysis, and image processing and so forth. However most of the well-known clustering algorithms have some drawbacks which will be presented later when ap- plied in large spatial databases. To overcome these limitations, in this paper we propose a robust spatial clustering algorithm named NSCABDT (Novel Spatial Clustering Algorithm Based on Delaunay Triangulation). Delaunay dia- gram is used for determining neighborhoods based on the neighborhood notion, spatial association rules and colloca- tions being defined. NSCABDT demonstrates several important advantages over the previous works. Firstly, it even discovers arbitrary shape of cluster distribution. Secondly, in order to execute NSCABDT, we do not need to know any priori nature of distribution. Third, like DBSCAN, Experiments show that NSCABDT does not require so much CPU processing time. Finally it handles efficiently outliers. Keywords: Spatial Data Mining, Delaunay Triangulation, Spatial Clustering 1. Introduction Data mining is a process to extract implicit, nontrivial, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases [1,2]. The explosive growth in data and databases used in business managements, government administration, and scientific data analysis has created a need for tools that can automatically transform the proc- essed data into useful information and knowledge [3]. Spatial data mining as a subfield of d ata mining refers to the extraction from spatial databases of implicit knowl- edge, spatial relations or significant features or patterns that are not explicitly stored in spatial d atabases [4]. It is concerned with the discovery of spatial relationships and intrinsic relationships between spatial and non-spatial data. With the large amount of spatial data obtained from satellite images and geographic information systems (GIS), it is an inevitable task for humans to explore spa- tial data in detail. Spatial datasets and patterns are abun- dant in many application domains related to the Envi- ronmental Protection Agency, the National Institute of standards and Technology, and the Department of Transportation. Challenges in spatial data mining arise from the following issues [3,5]. Firstly, classical data mining is designed to process numbers and categories. In contrast, spatial data is more complex and includes ex- tended objects such as points, lines and polygons. Sec- ondly, classical data mining works with explicit inputs, whereas, spatial predicates and attributes are often im- plicit. Finally, classical data mining treats each input independently o f oth er inputs, while sp atial patterns of ten exhibit continuity and high autocorrelation among nearby features. Clustering is the process of grouping a set of objects into classes or clusters so that objects within a cluster have similarity in comparison to one another, but are dissimilar to objects in other clusters. So far, many clus- tering algorithms have been proposed. They differ in their capabilities, applicability and computational re- quirements. Based on a general definition, they can be categorized into five broad categories, i.e., hierarchical, partitional, density-based, grid-based and model-based [4]. 1) Partitional clustering methods [6], for example, A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation 142 CLARANS. It classifies data into some groups, which together satisfy the following requirements: firstly, each group must contain at least one object; secondly, each object must belong to exactly on group. It is noticed that the second requirement can be relaxed in some fuzzy partitioning techniques. 2) Hierarchical clustering me- thods [7,8], such as DIANA [9] and BIRCH [8]. A hier- archical method creates a hierarchical de comp o sition of a given set of data objects. Hierarchical methods can be classified as agglomerative (bottom-up) or divisive (top- down), based on how the hierarchical decomposition is formed. 3) Density-based clustering methods. Their gen- eral idea is to continue growing a g iven cluster as long as the density (the number of objects or data points) in the “neighborhood” exceeds a threshold. Such a method is able to filter out noises (outliers) an d discover clusters of arbitrary shape. Examples of density-based clustering methods include DBSCAN [10], OPTICS [11], GDB- SCAN [12] and DBRS [13]. 4) Grid-based clustering methods, such as STING [14] and WaveCluster [15]. Grid-based methods quantize the object space into a fi- nite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. 5) Model-based clustering methods. For example, COBWEB. It is clearly that no particular clus- tering method has been shown to be superior to all its competitors in all aspects. Typically, the problem is that clusters identified with one meth od can no t be detected by other methods [16]. This is because that many clustering methods need user-specified arguments or prior knowl- edge to produce their best results. Such information needs are supplied as density threshold values, merge/ split conditions, number of parts, prior probabilities, as- sumptions about the distribution of continuous attributes within classes, and/or kern el size for intensity testing, for example, grid size for raster-like clustering [17] and ra- dius for vector-like clustering [18]. This parame- ter-tuning is expensive and inefficient for huge data sets because it demands several trial and error steps. Clustering based on Delaunay triangulation is not a new and has been described in some papers [16, 19, 20, 21]. Kang et al [14] proposed a clustering algorithm that utilizes a Delaunay triangulation; however, there is a need in the algorithm to provide a global argument as a threshold to discriminate perimeter values or edges lengths. As a result, the algorithm is not able to detect local variations. The first non-parametric clustering algo- rithm based on the Delaunay diagram, called AMOEBA, has been proposed in Estivill-Castro and Lee [16]. It overcomes some of the problems of the static approaches that required a distance threshold to be specified, but still fails to find relatively sparse clusters in certain situations. An upgraded version of AMOEBA, called AUTO- CLUST, has been proposed by the same authors in Estiv- ill-Castro and Lee [21]. But these methods also have some drawbacks. For example, if two clusters are mixed or connected by bridges, this methods described above cannot detect all the two clusters as shown in Figure 1. In this paper we propose a robust spatial clustering algorithm named NSCABDT (Novel Spatial Clustering Algorithm Based on Delaunay Triangulation). NSCABDT uses the Delau- nay triangulation as analysis source, because Delaunay triangulation is a structure that is linear in the size of the data set and implicitly contains vast amounts of prox- imity information. That is to say, we can use the graph information of Delaunay triangulation and metric infor- mation to obtain remarkable robust clustering. In this study, we first constru ct a graph, and record the informa- tion of the graph as presented in Section 3. In the graph, vertices represent data points and edges connect pairs of points to model spatial proximity or interactions and all clustering operations are performed on the graph infor- mation. The remainder of the paper is organized as follows: In Section 2, we will give an introduction to data preproc- essing for NSCABDT. And Section 3 presents the NSCABDT algorithm. Section 4 reports the experimental evaluation. Finally, Section 5 concludes the paper. 2. Definitions and Notions 2.1 Spatial Clustering methods Geographic data often show properties of spatial de- pendency and spatial heterogeneity [22]. Spatial de- pendency is a tendency of observations located close to one another in the geographical space to show a higher degree of similarity or dissimilarity (depending on the phenomenon under study). Closeness can be defined very generally—through distance, direction and/or topology. Spatial heterogeneity or inconsistency of the process with respect to its location is often visible, while many geograph ic Figure 1. Three very dense clusters (C1, C2, C3), but most of clustering methods cannot detect the cluster C2. Because of the bridges between cluster C4 and cluster C5, the two clusters often are incorrectly thought to be one cluster Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation143 processes have a local character. Spatial dependency and heterogeneity can reflect the nature of the geographic process. Central to spatial data mining is clustering, which seeks to identify subsets o f the data having similar characteristics. Two-Dimensional clustering is the non- trivial process of grouping geographically closer points into the cluster. Therefore, a model of spatial proximity for a discrete point-data set must pro- vide robust answers to which are the neighbors of a point and how far the neighbors relative to the context of the entire data set },,{ 1n ppP i p P . A cluster is a group of objects, which are homogeneous among themselves. Clustering has been identified as one of the fundamental problems in the area of knowledge discovery and data mining, and it is of particular importance for spatial data sets. A dis- tinct characteristic of spatial clustering for data mining applications is the huge size of the data files involved [23]. As Tobler’s famous proposition [24] states: “Eve- rything is related to everything else, but near things are more related than distant things.” Thus proximity is pretty critical to spatial analysis and in spatial settings; clustering criteria almost invariably makes use of some notions of proximity, usually based on the Euclidean metric, as it captures the essence of spatial autocorrela- tion and spatial asso ciation [14]. 22 22 2 11 )()()( ),( jpipjiji jj xxxxxx ZXd (1) d(Xj, Zj)is the Euclidean metric. And and are twodimensions data objects. ),,,(21 ipii xxx )2 ),,,( 21 jpjj xxx (pp We assume that is a set of data items in the dimensional real space . A cluster is a collection of that is similar to one an- other within the same cluster and is dissimilar to the ob- jects in other clusters. A cluster of data objects can be treated collectively as one group. We assume that is a cluster ofand is the collection of , then: },,,{ 1210 n ppppS m S },,,{ 21 k CCCC nm k C S i)1( kCi p ij j CS (2) j CØ( ) (3) kj ,...,2,1 Ø ( ) (4) ji CC jijji ;,...,2,1, If Zj is the clustering center of , then: )1(kjC j k jCs jj jj ZpdJ 1 ),( (5) where J should be minimal. 2.2 Delaunay Triangulations In mathematics, and computational geometry, a Delau- nay triangulation for a set S of points in the plane is a triangulation D(S) such that no point in S is inside the circumcircle of any triangle in D(S). Delaunay triangula- tions maximize the minimum angle of all the angles of the triangles in the triangulation; they tend to avoid skinny triangles. The triangu lation was invented by Boris Delaunay [25]. Delaunay triangulations have been widely used in a variety of applications in geographical informa- tion systems (GIS). Using Delaunay triangulations, it is simpler to tackle the problems associated with spatial topology automated contouring, two-and-half dimen- sional (2.5-D) visualization, surface characterization and reconstructions, and site visibility analyses on terrain surfaces. Given the set of data points in the plane, the Voronoi region of is the locus of points (not necessarily data points) which haveas a nearest neighbor; that is,{. Taken together, the n Voronoi regions of S form the Vo- ronoi diagram of S (also called the Dirichlet tessellation or the proximity map). The regions are (possibly un- bounded) convex polygons, and their interiors are dis- joint [23]. Based on Delaunay’s definition [25], the cir- cumcircle of a triangle formed by three points from the original point set is empty if it does not contain vertices other than the three that define it (other points are per- mitted only on the very perimeter, not inside). },,,{ 110 n pppS Spi i p ,(),(,| 2 ji pxdpxdij )}x The Delaunay triangulation D(S) of S is a planar graph embedding defined as follows: the nodes of D(S) consist of the data points of S, and two nodes pi, pj are joined by an edge if the boundaries of the corresponding Voronoi regions share a line segment. Delaunay triangulations capture in a very compact form the proximity relationships among the points of. They have many useful properties, the most relevant to our application being the following: S 1) If pj is the nearest neighbor of pj from among the data points of S, thenis an edge in D(S). That is to say, the 1-nearest-neighb or digraph is a subgraph of the Delaunay triangulation. ji pp , 2) The number of edges in D(S) is at most 3n -6. 3) The average number of neighbors of a site in D(S) is less than six. i s Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation 144 Figure 2. A data set (n=15) and its Delaunay triangulation 4) The Delaunay triangulation is the most well propor- tioned over all triangulations of S, in that the size of the minimum angle over all its triangles is the maximum possible. 5) If pi and pj form a triangle in D(S), th en the inter ior of this triangle contains no other point of . S 6) The triangulation D(S) can be robustly computed in time. )log(nnO 7) The minimum spanning tree is a subgraph of the Delaunay triangulation, and, in fact, a single-linkage clustering (or dendrogram) can be found in time from D(S). )log( nnO Figure 2 shows a set of 15 data points and its corre- sponding Delaunay triangulation. More information re- garding Delaunay triangulations and Voronoi diagrams can be found in other literature. From Figure 2, we can conclude that, in a proximity graph like Delaunay trian- gulation (Delaunay diagram); the points are connected by edges, if and only if they seem to be close by some proximity measure [26]. By applying to this rule, if two points are connected by a small enough Delaunay edge, the two points belong to the same cluster. 3. Initialization Using the Delaunay Triangulations 3.1 Data Preprocessing Given a set of data pointsin the plane (as shown in Figure 2), .The triangula- tions were computed by Bowyer-Watson algorithm in time. In the creation process of Delaunay triangulation, we recorded node, edge and surface infor- mation of Delaunay triangulation for clustering later. This nodes, edges and surfaces information was stored in Oracle database. Oracle database includes numerous data structures to improve the speed of SQL queries. Taking advantage of the low cost of disk storage, Oracle in- cludes many new indexing algorithms that dramatically increase the speed with which Oracle queries are ser- viced. And, Oracle database includes so many statistical functions which include descriptive statistics, hypothesis testing, and correlations analysis, for distribution fit and so forth. The statistical functions in the database can be used in a variety of ways, for example, we can call Ora- cle’s DBMS_STAT_FUNCS functions to obtain basic cont, mean, max, min and standard deviation information of Delaunay triangulation edges. For Figure 2, we got three tables as follows: },,,{ 110 n pppS 15n )log(nnO In the Table 1, the first column is the index of the points in S, the second column is X coordinate and the third column is Y coordinate respectively. The degree denotes the number of Delaunay edges which incident to a point. The “ClassType” column represents the category number after clustering process, and after clustering process if it is -1, we think the point is an outlier or noise. In the Table 2, the second column is the index of the edge’s starting point, and the third column is the index of the edge’s end point. The length of edges is represented by the fourth column. In our algorithm, every edge is needed to be compute d o nly once. The chart illustrates the table structure and relation- ships of the three tables. The Delaunay triangulation node table contains all the spatial objects (points); the Table 1. Delaunay triangulation nodes table IndexX Y Degree ClassType 1 3853964.924-803305.9261 6 -1 2 3853985.696-803331.6837 4 -1 3 3853998.714-803330.2989 6 -1 4 3853994.005-803335.8381 5 -1 5 3853992.066-803329.7449 4 -1 6 3854013.393-803354.3946 6 -1 7 3853989.297-803371.0124 6 -1 8 3853990.128-803377.6595 4 -1 9 3853996.221-803375.7208 5 -1 10 3854049.121-803327.2523 5 -1 11 3854045.52 -803337.4999 7 -1 12 3854058.538-803336.669 4 -1 13 3854051.337-803334.4533 3 -1 14 3854039.981-803371.8433 4 -1 15 3854047.459-803375.7208 4 -1 Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation Copyright © 2010 SciRes JSEA 145 Table 2. Delaunay triangulation edge table Index Start End Length 1 1 8 76.03228669 2 8 7 6.69884313 3 7 1 69.50014083 4 1 7 69.50014083 5 7 2 39.49321264 6 2 1 33.08972562 7 1 2 33.08972562 8 2 5 6.65851675 9 5 1 36.11126413 10 1 5 36.11126413 …… …… …… …… Table 3. Delaunay triangulation surface table In- dex Node1 Node2 Node3 Edge1 Edge2 Edge3 1 1 8 7 1 2 3 2 1 7 2 4 5 6 …… …… …… …… …… …… …… Figure 3. Table structure and relations in the database Delaunay triangulation edge table includes all the Delaunay edges and the relationships with the Delaunay triangula- tion node table. And the Delaunay triangulation surface table records all the Delaunay triangulation surfaces and the relationships with Delaunay triang ulation nodes table as well as Delaunay edge table. 3.2 Some Definitions and Notions in NSCABDT Given a set of data pointsin the plane (as shown in Figure 2), after data preprocessing, we got the nodes, edges and surfaces information of the Delau- nay triangulation. Given a set of edges, for each edge },,,{ 110 n pppS ,{0 eE},, 11 n ee )10( nkekin E is a record of Delaunay triangulation edge table. For each edge , jik ppe , )10( nk i p i pj p is its starting point, and is its end point. Both and belong to . And denotes a set of edges which in cident to. j p )( i pN i p S i p Definition 1 (Local_Mean): We denote by mean () the mean length of edges in . )N(i p )( )( ) i(_ MeanLocal ) i i (d j d p1 p p i p S j eLen (6) where denotes the degree of in graph theory; and denotes the length of the Delaunay edge . )(i pd )( j eLen j e Definition 2 (Global_Mean): We denote by mean the mean length of edges in E . )( )EsumLen E)(ej )(_ SMean )E ( 1 sum j Global (sum (7) where is the number of edges in E . Definition 3 (Global_Sta_Dev): We denote by global standard deviation of the lengths of all edges. That is, n eS i 2 )))( ) Len( Re Mean (p Global SDevStaGlobal n i 0_( (__ _Local (8) Definition 4 (Relative_Mean): We let denote the ratio of andGlobal . That is, )(i pMean )(SMean _lative _ ) i Mean (9) )(S_)_Re Meanplative i (Mean_)( LocalpMean iGlobal Definition 5 (Positive Edge): If the length of a Delau- nay edge is less than the given criterion function , the edge is a positive edge. Positive edges and points incident to them form a new proximity graph and the newly created graph is subgraph of the Delaunay graph (Delaunay Triangulation). )(i pF Definition 6 (Positive path): A path in current prox- imity graph where every edge in the path is a positive edge; and all the points connected by actives paths be- long to a cluster. Finally, this edge analysis is captured in a criterion function . The cut-off value for edges incident in is defined as follows: (F i p ) i p )(_ )(_ )(_ )(_ (_ 1 i i pMean SMean SMean pMeanlative SMeanGlobal )(_ Re (_)( i Local Global SStaGlobal DevGlobalpF _ )S_ Dev Sta) Global (10) A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation 146 Figure 4. NSCABDT procedure Definition 7 (Effective Region): For a point p in , we call S },,|||{ 2 RxSpxpxp (11) The effective region of point p with respect to radius . As the union of the effective region of each point, we define the effective region of a random point set. For a point set S, we call pS Sp (12) The effective region of point set with respect to radius S . Definition 8 (γ – boundary): We define the boundary set as )}({\, DmVVVVVV (13) where γ>1 is a constant and is the set of all points in the circle with the radius }|||{ xxDm . We call the γ – boundary of the point set S. V Definition 9 (γ – curve): The principal boundary of a random point set is the principal manifold of the point in the γ – boundary of a point set. We also call this principal manifold extracted from point set S the γ – curve of. S If the edges which incident toare greater than or equal to, the edges are eliminated and the edges that are less than the criterion function survive. For each definition above, it is not necessary to iteratively calcu- late the results by programming; because we can get the results by using oracle statistical functions. For example, for Definition 1, a SQL statement can be created as fol- lows: SELECT avg(length) FROM edgetable WHERE start =, where “edgetable” is the table which restores the edges information of Delaunay triangulation. i p )( i pF i p We now present the algorithm of NSCABDT: Initialize the points of a data points setas being as- signed to no cluster; Initialize an empty data set; S C 1) Create Delaunay triangulation and record the in- formation of Delaunay triangulation in Oracle da- tabase. 2) For each nodei pin Delaunay triangulation, extract edges )( i pN incident to nodei pvia SQL queries and calculate)( as well as)( i pF . _i pMeanLocal 3) For each edgeein )( i pN , if)() , the edge will be deleted. (i pFeLen 4) After 3, if0)( i pd , the nodei pwill be deleted, otherwise, the nodei pis added toC. 5) Using the same method, iteratively calculate all the nodes which co nnect withi p. 6) Extract the boundary of the clusterCand eliminate the bridges. 7) If all the points have been not processed, end the process. Otherwise, initialize a new empty data setC, go to next un -processed node. Phase 1 of NSCABDT is th e construction o f Delaunay triangulation. Then, recursively, all points in a connected component are reported as a cluster. Thus every edge is tested for the criterion function only once. After elimi- nating no-interesting edges and noises, only positive edges are remaining. According to the positive path, we can iteratively find all the points connected by positive paths and add the points to a cluster. Obviously, it can be seen from the Figure 4 that it con- sists of two phases; the first phase is building Delaunay Triangulations from spatial objects. And on the second phase, we eliminate all edges in the way which we in- troduced above. And then, we got that point 1, point 6, point 14 and point 15 are outliers. In order to eliminate the bridges between two different clusters, a detection of cluster boundary is executed; the algorithm is according to [5 ]. The boundary of a poin t set is extracted by the principal curve analysis. The principal curve analysis is a generalization of principal axis analy- sis, which is a standard method for data analysis in pat- tern recognition. For a clus ter, if we can get two different boundaries, we think there are two smaller clusters in the point set, and bridges exist between the two smaller clusters. If an edge is not in the boundaries, the edge should be delete d. The algorithm for eliminating the bridges between two Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation147 different clusters is as follows: 1) For the collection of all edges E, get the median length via SQL queries. And set it as . 2) Get the effective region of random point setV. 3) Get boundary of random point setV. 4) Get curve of random point setV. Although, the construction of Delaunay triangulation using all points in S is a time-consuming process for a large number of points even if we use an optimal algo- rithm, we can use the information stored in the database instead of the construction of Delaunay triangulation again and we also can get the median length via SQL queries. Obviously, it is more efficient. 4. Experimental Results We evaluate NSCABDT according the three major re- quirements for clustering algorithms on large spatial da- tabases as stated above. We compare NSCABDT with the clustering algorithm DBSCAN in terms of effectivity and efficiency. The evaluation is based on an implemen- tation of NSCABDT in .NET 2005. All the experiments were run on Windows Server 2003. 4.1 Discovery of Clusters with Arbitrary Shape Clusters in spatial databases may be of arbitrary shape, e.g. spherical, drawn-out, linear, elongated etc. Further- more, the databases may contain noise [27]. We used visualization to evaluate the quality of the clusterings obtained by the NSCABDT. In order to create readable visualizations without using color, in these experiments we used small databases. Due to space limitation, we only present the results from one typical database which was generated as follows: 1) Draw three polygons of different shape (one of them with a hole) for three clusters. 2) Generate 500, 200 and 200 uniformly distributed points in each polygon respectively. 3) Insert 100 noise points into the database, which is depicted in Figure 5. For NSCABDT, we set 10% noise for the sample da- tabase. NSCABDT discovers all clusters and detects the noise points from the sample database. The clustering result of NSCABDT on this database is shown in Figure 7. Different clusters are depicted using different symbols and noise is represented by crosses. This result shows that NSCABDT assigns nearly all points to the correct clusters. 4.1 Efficiency It has been proved that DBSCAN has better performan- cethan partitioning and hierarchical algorithms for spatial data mining, so we only compare our algorithm with DBSCAN [28]. In the following, we compare NSCA- BDT with DBSCAN with respect to efficiency on syn- Figure 5. A data set and its Delaunay triangulation (n=1000) Figure 6. Extract the boundary of the cluste r and eliminate the bridges Figure 7. Clustering by NSCABDT. Finally we got 3 clus- ters thetic databases. The run time and correct rate for NS- CABDT, DBSCAN on these test databases are listed in Table 4. We generated some large synthetic test databases with 5000, 6000, 7000, 8000, 9000 and 10000 points to test the efficiency and scalability of DBSCAN and NSCA- BDT. We can conclude that NSCABDT is significantly slower than DBSCAN (see Figure 8), but the correct rate of NSCABDT is higher than DBSCAN (see Figure 9). Because our approach does not require any assump- tions or declarations concerning the distribution of the Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation 148 Table 4. Run time in seconds Number of Points 5000 6000 7000 8000 9000 10000 Correct rate Run time Correct rate Run time Correct rate Run time Correct rate Run time Correct rate Run time Correct rate Run time DBSCAN 97.8% 36.797.2% 52.8 97.4%72.697.9%93.7 97.5%121.5 97.8% 154.6 NSCABDT 99.7% 48.299.8% 71.4 99.8%93.899.7%117.999.7%151.3 99.7% 189.4 data, the parameters of DBSCAN is difficult to be fixed. If the parameters are not inappropriate, the correct rate will be very low. DBSCAN must continually ask for as- sistance from the user. The reliance of DBSCAN on user input can be eliminated using our approach. 5. Conclusions The application of clustering algorithms to large spatial databases raises the following requirements [27]: 1) minimal number of input parameters, 2) discovery of clusters with arbitrary shape and 3) efficiency on large databases. The w ell-kn own clu ster ing algo rithms off er no solution to the combination of these requirements. In this paper, we introduce the new clustering algo- rithm NSCABDT. Our notion of a cluster is based on the distance of the points of a cluster to their neighbors. The neighboring region formed in our algorithm reflects the neighbor’s distribution. Experimental results demon- strated that our clustering algorithm can provide signifi- cant improvement of accuracy of the cluster detecting, especially for objects with arbitrary and linear distribu- tion. Figure 8. Efficiency: SCABDT VS DBSCAN Figure 9. Correct rate: SCABDT VS DBSCAN REFERENCES [1] G. Piatetsky-Shapiro and W. J. Frawley. “Knowledge discovery in databases,” AAAI/MIN Press, 1999. [2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. “The KDD process for extracting useful knowledge from vol- umes of data,” Communications of ACM, Vol. 39, 1996. [3] S. Shekhar, C. T. Lu, P. Zhan g, and R. Liu, “Data mining for selective visualization of large spatial datasets,” Proc- essing of 14th IEEE international conference on tools with artificial intelligence (ICTAI’02), 2002. [4] J. Han and M. Kamber, “Data mining: Concepts and Techniques,” Academic Press, 2001. [5] I. Atsushi and T. Ken, “Graph-based clustering of random point set,” Structural, Syntactic and Statistical Pattern Recognition, Springer Berlin, pp. 948–956, 2004. [6] R. T. Ng and J. Han, “CLARANS: A method for cluster- ing objects for spatial data mining,” IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 5, pp. 1003–1016, 2002. [7] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” Proceedings of the 1998 ACM SIGMOD international conference on Management of data, ACM Press, pp. 73–84, 1998. [8] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” Proceedings of the 1996 ACM SIGMOD international conference on Management of data, ACM Press, pp. 103–114, 1996. [9] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An introduction to cluster analysis,” John Wiley & Sons, 1990. [10] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “Density- based algorithm for discovering clusters in large spatial databases with noise,” Proceedings of the 1996 Knowl- edge Discovery and Data Mining (KDD’96) int ernational conference, AAAI Press, pp. 226–231, 1996. [11] M. Ankerst, M. M. Breunig, H. P. Kriegel, et al., “OP- TICS: Ordering points to identify the clustering struc- ture,” Proceedings of the International Conference on Management of Data (SIGMOD), ACM Press, pp. 49–60, 1999. [12] J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Den- sity-Based Clustering in Spatial Databases: The algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, Vol. 2, No. 2, pp.169–194, 1998. [13] X. Wang and H. J. Hamilton, “DBRS: A density-based spatial clustering method with random sampling,” Pro- ceedings of the 7th PAKDD, Springer, pp. 563–575, 2003. Copyright © 2010 SciRes JSEA A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation149 [14] R. P. Haining, “Spatial data analysis in the social and environmental sciences,” Cambridge University Press, 1990. [15] J. Han and M. Kamber, “Data Mining: Concepts and techniques,” Second Edition, Morgan Kaufmann, 2006. [16] V. Estivill-Castro and I. Lee, “AMOEBA: Hierarchical clustering based on spatial proximity using delaunay dia- gram,” Proceedings of the 9th international symposium on spatial data handling, pp. 7a. 26–7a. 41, 2000. [17] E. Schikuta and M. Erhart, “The BANG-clustering sys- tem: Grid-based data analysis,” Proceedings of the 2nd international symposium IDA-97, Advances in intelligent data analysis, Springer-Verlag, pp. 513–524, 1997. [18] S. Openshaw, “A mark 1 geographical analysis machine for the automated analysis of point data sets,” Interna- tional Journal of GIS, Vol. 1, No. 4, pp. 335–358, 1987. [19] In-Soo Kang, Tae-wan Kim, and Ki-Joune Li, “A spatial data mining method by delaunay triangulation,” Proceed- ing of 5th ACM Workshop on Geographic Information Systems, Las Vegas, Nevada, pp. 35–39, 1997. [20] C. Eldershaw and M. Hegland, “Cluster analysis using triangulation,” Computational Techniques and Applica- tions (CTAC97), World Scientific, Singapore, pp. 201– 208, 1997. [21] V. Estivill-Castro and I. Lee, “AUTOCLUST: Automatic clustering via boundary extraction for mining massive point-data sets,” Proceedings of the 5th international con- ference on geocomputation, 2000. [22] H. J. Miller, “Geographic data mining and knowledge discovery,” Handbook of geographic information science. Malden, MA: Blackwell, pp. 149–159, 2009. [23] V. Estivill-Castro and M. E. Houle., “Robust Distance- based clustering with applications to spatial data mining,” Algorithmica, Vol. 30, No. 2, pp. 216–242, 2001. [24] W. R. Tobler, “A computer movie simulating urban growth in the Detroit region,” Economic Geography, Vol. 46, No. 2, pp. 234–240, 1970. [25] B. Delaunay, “Sur la sphère vide, Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh i Estestven- nykh Nauk,” pp. 793–800, 1934. [26] G. Liotta, “Low degree algorithm for computing and checking gabriel graphs”, Report No. CS–96–28, De- partment of Computer Science in Brown University, Providence, 1996. [27] X. Xu, M. Ester, H. Kriegel, and J. Sander, “A distribu- tion-based clustering algorithm for mining in large spatial databases,” Proceedings of the 14th International Con- ference on Data Engineering (ICDE’98), pp. 324–331, 1998. [28] D. Y. Ma and A. D. Zhang, “An adaptive density-based clustering algorithm for spatial database with noise,” ICDM, Proceedings Fourth IEEE International Confer- ence, pp. 467–470, 2004. Copyright © 2010 SciRes JSEA |