Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas of data mining. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity. Clustering techniques are applied in different domains to predict future trends of available data and its uses for the real world. This research work is carried out to find the performance of two of the most delegated, partition based clustering algorithms namely k-Means and k-Medoids. A state of art analysis of these two algorithms is implemented and performance is analyzed based on their clustering result quality by means of its execution time and other components. Telecommunication data is the source data for this analysis. The connection oriented broadband data is given as input to find the clustering quality of the algorithms. Distance between the server locations and their connection is considered for clustering. Execution time for each algorithm is analyzed and the results are compared with one another. Results found in comparison study are satisfactory for the chosen application.
Data Mining (DM) is a convenient way of extracting patterns, which represents knowledge implicitly stored in large data sets and focuses on issues relating to their feasibility, usefulness, effectiveness and scalability. Data mining approach and its technology is used to extract the unknown pattern from the large set of data for the business and real time applications. It can be viewed as an essential step in the process of knowledge discovery. Data are normally preprocessed through data cleaning, data integration, data selection, and data transformation and prepared for the mining task. Started as little more than a dry extension of DM techniques, DM is now bringing important contributions in crucial fields of investigations. Among the traditional sciences like astronomy, high energy physics, biology and medicine [
Data mining can be performed on various types of databases and information repositories, but the kind of patterns to be found are specified by various data mining functionalities like class/concept description, association, correlation analysis, classification, prediction, cluster analysis etc. Among these, Cluster analysis is one of the major data analysis method widely used for many practical applications in emerging areas [
The remainder of the paper is structured as follows. The next section provides a comprehensive outline of related work via literature survey. Section 3 describes the basic approach and method of both algorithms. An experimental setup of the telecommunication data and the properties of the same data are discussed in Section 4. Section 5 explores the clustering process and obtained results of the algorithms. Finally, Section 6 contains the concluding remarks of the research work.
Nowadays, data clustering has attracted the attention of many researchers in different disciplines. It is an important and useful technique in data analysis. A large number of clustering algorithms have been put forward and investigated. The main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge. The cluster results are not subjective, but implementation dependent. Data Clustering has been addressed by many researchers and many clustering approaches have been explored and studied. A variety of data clustering algorithms are developed and applied for many applications domain in the field of data mining. Clustering techniques have been applied to a wide variety of research problems. Hartigan provides an excellent summary of the many published studies reporting the results of cluster analyses [
Bradley P.S and Fayyad describe refining Initial Points for k-Means Clustering in their paper [
A review of the most common partition algorithms in cluster analysis: a comparative study is discussed in a research work by Susana et al., in [
An Enhanced k-means algorithm to improve the Efficiency Using Normal Distribution Data Points is discussed by Napoleon and Ganga Lakshmi in their research work [
A Novel Approach to Medical Image Segmentation is presented by Shanmugam et al., in their paper [
There are number of research articles utilizing broad band data for the analysis of various types of networks. Also, some of the clustering algorithms are utilized to analyze telecommunication data. One such topic was done by Sung Suk Kim and Sun Ok Yang titled as “Wireless sensor gathering data during long time involving both telecommunication data and clustering algorithms”. In this paper [
Clustering is a concept to determine the pattern through map and analysis of available data set according to the need and demand of the business applications. Clustering is belonging to both data analysis and machine learning major domains. Many methodologies have been proposed in order to organize, to summarize or to simplify a dataset into a set of clusters such that data belonging to a same cluster are similar and data from different clusters are dissimilar [
The k-Means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori [
The objective function
J = ∑ j = 1 k ∑ i = 1 n ‖ x i ( j ) − c j ‖ 2 (1)
where ‖ x i j − c j ‖ 2 is a chosen distance measure between a data point x i j and the cluster centre cj, is an indicator of the distance of the n data points from their respective cluster centers. The algorithm is composed of the following steps:
Step 1: Place k points into the space represented by the objects that are being clustered. These points represent initial group centroids.
Step 2: Assign each object to the group that has the closest centroid.
Step 3: When all objects have been assigned, recalculate the positions of the k centroids.
Step 4: Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Although it can be proved that the procedure will always terminate, the k-Means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum [
The k-Means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data [
Input: k: The number of clusters
D: A data set containing n objects
Output: A set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid.
Method: Arbitrarily choose k objects in D as the initial representative objects;
Repeat assigneach remaining object to the cluster with the nearest medoid;
Randomly select a non medoid object Orandom;
Compute the total points S of swaping object Oj with Oramdom;
if S < 0 then swap Oj with Orandom to form the new set of k medoid;
Until no change;
It attempts to determine k partitions for n objects. After an initial random selection of k medoids, the algorithm repeatedly tries to make a better choice of medoids. Therefore, the algorithm is often called as representative object based algorithm.
Data mining concepts are used in different applications as per the need, demand, nature of the problem and domain. In this research, the clustering process is achieved using a distance method. The clustering process is aimed to minimize the expenditure of the business application and increase the benefits of the business. The algorithms are implemented in the real time connection oriented telecommunication data and the results are discussed. In this process, the communication connection structure is evaluated and reconstructed using clustering techniques for the effective data distribution. The data distribution process is affected by the connected server, distance and number of connections available in the specific server. The distance factor create an impact on the creation of the infrastructure using cable, cost of the cable, manpower, maintenance and the data distribution based on the bandwidth. Therefore, the data access points are considered as data points and planned to optimize the network using clustering concepts. After the clustering process, the number of connections for each server is changed.
The data set collected from a broadband service provider at Chennai city. The connection oriented data set contains 285,520 data connection points with 27 servers with locations. The 27 servers are treated as 27 clusters in this work and they are called as data centers. The user points are called as data access points. There are 12 data sets available. One data set for each month. The data set contains information about distance, type of connection (single user, multi user), data transfer capacity (256, 512, 1024, 2048), area code and the server number in which the data points are connected. This representation is based on the connections established from the month of January to December. The collected data consists of the connection establishment month, area and the connected data center, type of the data service and the volume of data used in the year. The total connection of data access points are connected according to the geographical location. This connection is made, based on the demand of the customer which is provided by the service provider. The total number of connected data points for each and every month is given in
The total number of data access points (number of user points before clustering) in all the 27 servers are given in
Month | Connection Numbers | Total Connections | |
---|---|---|---|
From | From | ||
Jan | 1 | 25,094 | 25,094 |
Feb | 25,095 | 45,416 | 20,322 |
Mar | 45,417 | 66,977 | 21,561 |
Apr | 66,978 | 94,928 | 27,951 |
May | 94,929 | 122,271 | 27,343 |
Jun | 122,272 | 143,345 | 21,074 |
Jul | 143,346 | 168,791 | 25,446 |
Aug | 168,792 | 189,130 | 20,339 |
Sep | 189,131 | 209,432 | 20,302 |
Oct | 209,433 | 231,583 | 22,151 |
Nov | 231,584 | 261,107 | 29,524 |
Dec | 261,108 | 285,520 | 24,413 |
Grand Total | 285,520 |
Server | Total | Server | Total | Server | Total |
---|---|---|---|---|---|
1 | 14,101 | 10 | 13,019 | 19 | 12,845 |
2 | 9607 | 11 | 12,922 | 20 | 11,534 |
3 | 9493 | 12 | 12,923 | 21 | 9778 |
4 | 11,024 | 13 | 13,203 | 22 | 8144 |
5 | 12,662 | 14 | 13,040 | 23 | 6697 |
6 | 12,964 | 15 | 12,904 | 24 | 5130 |
7 | 13,085 | 16 | 13,000 | 25 | 3363 |
8 | 13,098 | 17 | 13,120 | 26 | 1819 |
9 | 12,933 | 18 | 12,833 | 27 | 279 |
The k-Mean and k-Medoids algorithms are implemented using MATLAB software and the results are discussed in this section. In the implementation process, the data set is processed based on the distance. Initially, first data center and the first month (January) are selected. For the selected data center and for the month, the distance is reconstructed and stored in the process matrix. The reconstruction of distance is made by using the Pythagoras theorem. After the reconstruction of the distance, the data access points are clustered using any one of the taken algorithm. In this process, the number of user points in each and every data center is reassigned. Therefore, each data center has some new number of data access points after the process. The computational time and the number of connections in each server are stored in tables. This means that the starting and ending time of clustering process is stored in tables. Next, by choosing the same first data center, the second month (February) data access points are chosen and clustered using the chosen algorithm. This process is repeated upto the last month (December) data. After processing the 12th month data by choosing the first server, the second server is chosen and the process is repeated up to 27 servers. Hence, the number of connections in each data center is considered as application impact and the process time is considered as computation impact. The algorithmic steps involved in the clustering process are summarized below.
1) Selection of Algorithm from k-Means or k-Medoids
2) Selection of data center
3) Calculate the distance between data access points and servers based on selected data center
4) Selection of monthly data
5) Implementation of the selected algorithm and cluster the distance
6) According to the processed cluster, the data points are reassigned to the data center
7) Observe the cluster process start time and completion time
8) Summaries the number of connection in each data center
9) Implement the step (2) to (8) to all the different data sets
10) Represent the data connection according to the newly assigned data center
Usually, clustering approaches yields different kind of results, this depends on the nature and chosen application of the problems. Next section discusses the results and interpretations about the results of the algorithms produced as output in the process.
The k-Mean clustering algorithm is implemented as per the discussion above.
Month | k-Means Algorithm | k-Medoids Algorithm | ||||
---|---|---|---|---|---|---|
Start Time | End Time | Elapsed Time | Start Time | End Time | Elapsed Time | |
1 | 29.018 | 45.671 | 16.653 | 41.047 | 15.6 | 34.505 |
2 | 29.363 | 38.942 | 9.579 | 0.460 | 33.7 | 33.205 |
3 | 27.144 | 4.152 | 37.008 | 20.023 | 56.5 | 36.491 |
4 | 43.083 | 1.773 | 18.690 | 38.515 | 18.3 | 39.768 |
5 | 41.498 | 2.019 | 20.521 | 57.559 | 35.8 | 38.246 |
6 | 42.292 | 58.14 | 15.848 | 14.983 | 47.2 | 32.247 |
7 | 37.016 | 53.722 | 16.706 | 25.383 | 56.9 | 31.531 |
8 | 33.336 | 42.165 | 8.829 | 35.442 | 8.4 | 32.958 |
9 | 21.446 | 29.496 | 8.050 | 46.303 | 19.2 | 32.890 |
10 | 8.745 | 24.782 | 16.037 | 57.410 | 34 | 36.543 |
11 | 3.969 | 35.325 | 31.356 | 12.646 | 56.5 | 43.837 |
12 | 16.26 | 32.078 | 15.818 | 36.650 | 13.2 | 36.568 |
Average Process Time | 17.925 | Average Process Time | 35.732 |
month data is available in the last row, which is found to be 17.925 sec. The data points in each data center points are clustered (distributed) using the k-Means algorithm based on the neighborhood distance. The minimum time taken by the algorithm is 8.050 seconds and the maximum is 37.008 seconds. To avoid lengthy discussion, the number of data points created by the algorithm is not shown. The first server is chosen as a center point, in the same way the second server is chosen and the clustering process is repeated.
In the similar fashion, the results of k-Medoids algorithms are also given in the
Run | k-Means | k-Medoids | Run | k-Means | k-Medoids |
---|---|---|---|---|---|
1 | 17.925 | 35.732 | 7 | 15.919 | 39.614 |
2 | 17.142 | 31.924 | 8 | 17.224 | 37.899 |
3 | 16.822 | 33.417 | 9 | 18.700 | 50.904 |
4 | 19.664 | 37.467 | 10 | 18.984 | 57.659 |
5 | 19.796 | 35.511 | 11 | 19.474 | 43.953 |
6 | 16.129 | 38.747 | 12 | 16.243 | 42.728 |
Average | 17.84 | 40.46 |
Data Center | k-Means | k-Medoids | Data Center | k-Means | k-Medoids |
---|---|---|---|---|---|
1 | 10,543 | 10,333 | 15 | 10,276 | 10,502 |
2 | 11,185 | 11,361 | 16 | 10,840 | 10,746 |
3 | 10,816 | 10,639 | 17 | 10,567 | 10,605 |
4 | 10,702 | 10,544 | 18 | 10,559 | 10,538 |
5 | 10,249 | 11,243 | 19 | 10,663 | 10,303 |
6 | 10,773 | 11,038 | 20 | 10,744 | 9808 |
7 | 10,935 | 10,361 | 21 | 10,089 | 10,438 |
8 | 10,372 | 10,457 | 22 | 10,391 | 10,355 |
9 | 10,586 | 10,877 | 23 | 9853 | 10,363 |
10 | 10,311 | 10,888 | 24 | 10,763 | 10,569 |
11 | 10,506 | 10,375 | 25 | 10,487 | 10,014 |
12 | 10,754 | 10,632 | 26 | 10,472 | 10,593 |
13 | 10,519 | 10,613 | 27 | 10,731 | 10,902 |
14 | 10,833 | 10,424 | Total | 285,520 | 285,520 |
The total numbers of 285,520 data access points are clustered which are available in 27 data centers by the chosen two algorithms. Based on the distance between the data access points and data centers, the performance and efficiency of the clustering process is analyzed. The k-Means algorithm assigns a minimum of 9853 data points and a maximum of 11,185 data points after the clustering. The minimum and maximum data points assigned by the k-Medoids method is 9808 and 11361 respectively.
of the algorithms. From this figure, it is easy to identify that the differences between performance of given two algorithms. Based on the result of several executions of these two algorithms in the MATLAB software, the clustered results are analyzed. According to the efficiency of the algorithms, the performance of k-Means method is better than the k-Medoids methods. It is evident that from the
Cluster analysis is still an active field of development. Many cluster analysis techniques do not have a strong formal basis. Cluster analysis is a rather ad-hoc field. There are a wide variety of clustering techniques. Comparisons among different clustering techniques are difficult. All techniques seem to impose a certain structure on the data and yet few authors describe the type of limitations being imposed. In spite of all these problems, clustering analysis is a useful (and interesting) field. In summary, clustering is an interesting, useful, and challenging problem. It has great potential in applications like object recognition, image segmentation, and information filtering and retrieval. However, it is possible to exploit this potential only after making several carefully chosen designs and application. From the experimental approach, by several executions of the program for proposed algorithms in this research work, following results were obtained. Usually, the time complexity varies from one processor to another processor, which depends on the speed and the type of the system. The advantage of the k-Means algorithm is its favorable execution time. Its drawback is that the user has to know in advance how many clusters are searched for. From the experimental analysis, the distribution of number of connections for each and every server, produced by both the algorithms after clustering process is almost even. The computational time of k-Means algorithm is less than the k-Medoids algorithm. Further, k-Means algorithm stamps its superiority in terms of its lesser execution time. Finally, this work concludes that the k-Means algorithm is better than the k-Medoids algorithm for the chosen connection oriented telecommunication data.
Velmurugan, T. (2018) A State of Art Analysis of Telecommunication Data by k-Means and k-Medoids Clustering Algorithms. Journal of Computer and Communications, 6, 190-202. https://doi.org/10.4236/jcc.2018.61019