^{1}

^{*}

^{2}

^{*}

^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

^{2}

^{*}

^{2}

^{*}

^{2}

^{*}

In this study, we propose a data preprocessing algorithm called D-IMPACT inspired by the IMPACT clustering algorithm. D-IMPACT iteratively moves data points based on attraction and density to detect and remove noise and outliers, and separate clusters. Our experimental results on two-dimensional datasets and practical datasets show that this algorithm can produce new datasets such that the performance of the clustering algorithm is improved.

Clustering is the process of dividing a dataset into partitions such that intracluster similarity is maximized. Although it has a long history of development, there remain open problems, such as how to determine the number of clusters, the difficulty in identifying arbitrary shapes of clusters, and the curse of dimensionality [

Data preprocessing is often used to improve the quality of data. In relation to clustering, popular applications of data preprocessing are normalization, removing noisy data points, and feature reduction. Many studies have used Principal Component Analysis (PCA) [

Recent studies have focused on new categories of clustering algorithms which prioritize the application of data preprocessing. Shrinking, a data shrinking process, moves data points along the gradient of the density, generating condensed and widely separated clusters [

These two shrinking algorithms share the following limitations:

The process of shifting toward the median of neighbors can easily fracture the cluster (

The direction of the movement vector is not appropriate in specific cases. For example, if the clusters are adjacent and differ highly in density, the median of the neighbors is likely to be located on another cluster.

In addition to the distance, density [

IMPACT [

Clusters fractured after shrinking. (a) Original dataset; (b) Dataset after shrinking

attraction can be adjusted by various parameters to handle specific types of data. IMPACT is robust to input parameters and flexibly detects various types of clusters as shown in experimental results. However, there are steps that can be improved in IMPACT, such as noise removal, attraction computation, and cluster identification. Also, IMPACT has difficulties in clustering high dimensional data.

In this study, we propose a data preprocessing algorithm named D-IMPACT (Density-IMPACT) to improve the quality of the cluster analysis. It preprocesses the data based on the IMPACT algorithm and the concept of density. An advantage of our algorithm is its flexibility in relation to various types of data; it is possible to select an affinity function suitable for the characteristics of the dataset. This flexibility improves the quality of cluster analysis even if the dataset is high-dimensional and non-linearly distributed, or includes noisy samples.

In this section, we describe the data preprocessing algorithm D-IMPACT based on the concepts underlying in the IMPACT algorithm. We aim to improve the accuracy and flexibility of the movement of data points in the IMPACT algorithm by applying the concept of density to various affinity functions. These improvements will be described in the subsequent subsections.

The main difference between D-IMPACT and other algorithms is that the movement of data points can be varied by the density functions, the attraction functions, and an inertia value. This helps D-IMPACT detect different types of clusters and avoid many common clustering problems. In this subsection, we describe the scheme to move data points in D-IMPACT. We assume that the dataset has m samples and each sample is characterized by n features. We also denote the feature vector of the i^{th} sample by x_{i}.

We use two formulae to compute the density of a data point based on its neighbors, which are defined as data points located within a radius Φ. This density is calculated with and without considering the distance from the data point to its neighbors. We define the density δ_{i} for the data point x_{i} as

where

where _{i} and _{1}, the density function den_{2} considers not only the number of neighbors, but also the distance between them to avoid issues relating to the choice of threshold value, Φ. In a practical application, we scale the density to avoid scale differences arising from the use of specific datasets as follows:

In our D-IMPACT algorithm, the data points attract each other and one other closer. We define the attraction of data point x_{i} caused by x_{j} as

where _{i} and x_{j}. This quantity ignores the affinity between neighbors. The affinity can be computed using the following formulae:

These four formulae have been adopted to improve the quality of the movement process in specific cases. The function aff_{1}, used in IMPACT, considers the distance between two data points only. The function aff_{2} considers the effect of density on the attraction; highly aggregated data points cause stronger attraction between them than sparsely scattered ones. This technique can improve the accuracy of the movement process. The function aff_{3} considers the difference between the densities of two data points; two data points attract each other more strongly if their densities are similar. This can be used in the case where clusters are adjacent but have differing densities. The function aff_{4} is a combination of aff_{2} and aff_{3}. The parameter p is used to adjust the effect of the distance to the affinity. Attraction is the key value affecting the computation of the movement vectors. For each specific problem in clustering, an appropriate attraction computation can help D-IMPACT to correctly separate clusters.

Under the effect of attraction, two data points will move toward each other. This movement is represented by an n-dimensional vector called the affinity vector. We denote a_{ij} as the affinity vector of data point x_{i} caused by data point x_{j}. The k^{th }element of a_{ij} is defined as

The affinity vector is a component used to calculate the movement vector.

To shrink clusters, D-IMPACT moves the data points at the border region of original clusters toward the centroid of the cluster. Highly aggregated data points, usually located around the centroid of the cluster, should not move too far. In contrast, sparsely scattered data points at the border region should move toward the centroid quickly. Hence, we introduce an inertia value to adjust the magnitude of each movement vector. We define the inertia value I_{i} of data point x_{i} based on its density^{1 }by

D-IMPACT moves a data point based on its corresponding movement vector. The movement vector v_{i} of data point x_{i} is the summation of all affinity vectors that affect the data point x_{i}

where a_{ij} is the affinity vector. The movement vectors are then adjusted by the inertia value and scaled by s, which is a scaling value used to ensure the magnitude does not exceed a value Φ, as in the IMPACT algorithm. This scaling value is given by

Finally, each data point is moved using

where _{i} in the previous iteration, and _{i} in this iteration. We propose the algorithm D-IMPACT based on this scheme of moving data points.

D-IMPACT has two phases. The first phase detects noisy and outlier data points, and removes them. The second separates clusters by iteratively moving data points based on attraction and density functions. _{noise} as 0.1, which achieved the best result in our experiments.

First, the distance matrix is calculated. The density of each data point is then calculated by one of the formulae defined in the previous subsection. The threshold used to identify neighbors is computed based on the maximum distance and the input parameter q, and is given by

where

The outline of the D-IMPACT algorithm

The next step is noise and outlier detection. An outlier is a data point significantly distant from the clusters. We refer to data points which are close to clusters but do not belong to them to as noisy points, or noise, in this manuscript. Both of these data point types are usually located in sparsely scattered areas, that is, low-density regions. Hence, we can detect them based on density and the distance to clusters. We consider a data point as noisy if its density is less than a threshold Th_{noise}, and it has at least one neighbor which is noisy or a cluster-point (with the latter defined as a data point whose density is larger than Th_{noise}). An outlier is a point with a density less than Th_{noise} that has no neighbor which is noisy or a cluster-point.

Both outliers and noisy points are output and then removed from the dataset. The effectiveness of this removal is shown in

In this phase, the data points are iteratively moved until the termination criterion is met. The distances and the densities are calculated first, after which, we compute the components used to determine the movement vectors: attraction, affinity vector, and the inertia value. We then employ the movement method described in the previous section to move the data points. The movement shrinks the clusters to increase their separation from one another. This process is repeated until the termination condition is satisfied. In D-IMPACT, we adopt various termination criteria as follows:

Termination after a fixed number of iterations controlled by a parameter n_{iter}.

Termination based on the average of the densities of all data points.

Termination when the magnitudes of movement vectors have significantly decreased from the previous iteration.

When this phase is completed, the preprocessed dataset is output. The new dataset contains separated and shrunk clusters, with noise and outliers removed.

D-IMPACT is a computationally efficient algorithm. The cost of computing m^{2 }affinity vectors is

Illustration of noisy points and outliers

Illustration of the effect of noise removal in D-IMPACT

complexity of the computation of movement vectors is

We measured the real processing time of D-IMPACT on 10 synthetic datasets. For each dataset, the data points were randomly located (uniformly distributed). The sizes of the datasets varied from 1000 to 5000 samples. These datasets are included in the supplement to this paper. We compared D-IMPACT with CLUES using these datasets. D-IMPACT was employed with the parameter n_{iter} set to 5. For CLUES, the number of neighbors was set to 5% of the number of samples and the parameter itmax was set to 5. The experiments were executed using a workstation with a T6400 Core 2 Duo central processing unit running at 2.00 GHz with 4 GB of random access memory.

In this section, we compare the effectiveness of D-IMPACT and the shrinking function of CLUES (in short, CLUES) on different types of datasets.

To validate the effectiveness of D-IMPACT, we used different types of datasets: two dimensional (2D) datasets taken from the Machine Learning Repository (UCI) [

The 2D datasets are DM130, t4.4k, t8.8k, MultiCL, and Planet. They contain clusters with different shapes, densities and distributions, as well as noisy samples. The DM130 dataset has 130 data points: 100 points are generated randomly (uniformly distributed), and then three clusters, where each cluster comprises ten data points, are added to the top-left, top-right and bottom-middle area of the dataset (marked by red rectangles in

The practical datasets are more complex than the 2D datasets, i.e., the high dimensionality can greatly impact the usefulness of the distance function. We used the Wine, Iris, Water-treatment plant (WTP), and Lung cancer (LC) datasets from UCI, as well as the dataset GSE9712 from the Gene Expression Omnibus [

Processing times of D-IMPACT and CLUES on test datasets

Visualization of 2D datasets. a) DM130; b) MultiCL; c) t4.8k; d) t8.8k; e) Planet

. Datasets used for experiments

Dataset | Size of datasets | Number of features | Number of clusters |
---|---|---|---|

DM130 | 130 | 2 | 3 |

MultiCL | 8026 | 2 | 143 |

t4.8k | 8000 | 2 | 8 |

t8.8k | 8000 | 2 | 8 |

Planet | 719 | 2 | 2 |

Iris | 150 | 4 | 3 |

Wine | 178 | 13 | 3 |

WTP | 527 | 38 | 13 |

LC | 32 | 56 | 3 |

GSE9712 | 12 | 22,283 | 4 |

contains three classes (Iris Setosa, Iris Versicolor, Iris Virginica), each with 50 samples. One class is linearly separable from the other two; the latter are not linearly separable from each other. The Wine dataset (178 samples, 13 attributes), which are the results of chemical analysis of wines grown in the same region in Italy, but derived from three different cultivars, include three overlapping clusters. The WTP dataset (527 samples, 38 attributes) includes the record of the daily measures from sensors in an urban waste water-treatment plant. It is an imbalanced dataset—several clusters have only 1 - 4 members, corresponding to the days that have abnormal situations. The lung cancer (LC) dataset (32 samples, 56 attributes) describes 3 types of pathological lung cancers. Since the Wine, WTP, and LC datasets have attributes within different ranges, we perform scaling to avoid the domination of wide-range attributes. The last dataset we use is a gene expression dataset, GSE9712, which contains expression values of 22,283 genes from 12 radio-resistant and radio-sensitive tumors.

For a fair comparison, we employed CLUES implemented in R [_{1}, den_{1}, Th_{noise} = 0, n_{iter} = 2) with some modifications. The complete parameter set is described in

The results of D-IMPACT and CLUES on 2D datasets DM130, MultiCL, t4.8k, t8.8k, and Planet are displayed and analyzed in this section.

Clusters in the dataset DM130 are difficult to recognize since they are not dense or well separated. Therefore, we set the p to 4 and run D-IMPACT for longer (n_{iter} = 3). The D-IMPACT algorithm shrinks the clusters correctly and retains structures of the original dataset (

Visualization of the dataset DM130 preprocessed by D-IMPACT and CLUES. a) D-IMPACT; b) CLUES

. Parameter sets of D-IMPACT for experiments

Dataset | Parameter set |
---|---|

DM130 | p = 4, n_{iter} = 3 |

MultiCL | den_{2}, aff_{2} |

t4.8k | q = 0.03, Th_{noise} = 0.1, n_{iter} = 1 |

t8.8k | q = 0.03, Th_{noise} = 0.1, n_{iter} = 1 |

Planet | q = 0.05, p= 4, den_{2}, aff_{3}, n_{iter} = 4 |

Iris | n_{iter} = 5 |

Wine | |

WTP | Scale = true, aff_{2} |

LC | Scale = true |

GSE9712 |

The shrinking process may merge clusters incorrectly since clusters in the dataset MultiCL are dense and closely located. Hence, we used the density function den_{2} and the affinity function aff_{2}, which emphasizes the density, to preserve the clusters. The result is shown in

In relation to the two datasets t4.8k and t8.8k, D-IMPACT and CLUES are expected to remove noise and shrink clusters. We set q = 0.03 and Th_{noise} = 0.1 to detect carefully noise and outliers. The results of D-IMPACT are shown in

To separate adjacent clusters in the dataset Planet, we used the function aff_{3}, which considers the density difference. The parameter q is set to 0.05, since the data points are located near each other. We used den_{2} and p = 4 to emphasize the distance and density. The results are shown in

To avoid the domination of wide-range features, we scaled several datasets (Scale = true). In the case of Wine, we had to modify the inertiavalue and use p = 4 to emphasize the importance of nearest neighbors. We used HAC to cluster the original and preprocessed Iris and Wine datasets, and then validated the clustering results with aRI. A higher Rand Index score indicates a better clustering result. The Iris dataset was also preprocessed using a PCA-based denoising technique. However, the distance matrices before and after applying PCA are

Visualization of the dataset MultiCL preprocessed by D-IMPACT and CLUES. a) D-IMPACT; b) CLUES

Visualization of two datasets t4.8k and t8.8k preprocessed by D-IMPACT. a) t4.8k; b) t8.8k

Visualization of the dataset t4.8k preprocessed by CLUES using different values of k based on the size of the dataset. a) k = 80 (1%); b) k = 160 (2%)

Visualization of the dataset Planet preprocessed by D-IMPACT and CLUES. a) Preprocessed by D-IMPACT. Two clusters are separated; b) Preprocessed by CLUES; c) Clustering result using HAC on the dataset in b), indicating that CLUES shrinks clusters incorrectly

nearly the same (using 2, 3, or 4 principal components (PCs)). Therefore, the clustering results of HAC for the dataset preprocessed by PCA are at most the same result as that of the original dataset, which depends on the number of PCs used (aRI score ranged from 0.566 to 0.759).

We also performed k-means clustering [

. The Index scores of clustering results using HAC^{2} on the original and preprocessed datasets of Iris and Wine. The best scores are in bold

Dataset | Preprocessing algorithm | ||
---|---|---|---|

None | CLUES | D-IMPACT | |

Iris | 0.759 | 0.732 | 0.835 |

Wine | 0.810 | 0.899 | 0.884 |

GSE9712 | 0.330 | 0.139 | 0.330 |

. Index scores of clustering results using k-means on original and preprocessed datasets of IRIS and Wine. The best scores are in bold

Dataset | Preprocessing algorithm | ||
---|---|---|---|

None | CLUES | D-IMPACT | |

Iris | 0.730 (0.682) | 0.757 (0.677) | 0.757 (0.686) |

Wine | 0.899 (0.859) | 0.915 (0.814) | 0.899 (0.852) |

GSE9712 | 0.403 (0.212) | 0.139 (0.224) | 0.403 (0.329) |

To clearly show the effectiveness of the two algorithms, we visualized the Iris and Wine datasets preprocessed by D-IMPACT and CLUES as shown in

To validate the outlier separability, we tested CLUES and D-IMPACT on the WTP and LC datasets. The WTP dataset has small clusters (1 - 4 samples for each cluster). Using aff_{2}, we can reduce the effect of the affinity to these minor clusters. We show the dendrogram of HAC clustering results (using single-linkage) on the original and preprocessed dataset of WTP in

The lung cancer (LC) dataset was used by R. Visakh and B. Lakshmipathi to validate the outlier detection ability of an algorithm focusing on a constraint based cluster ensemble using spectral clustering, called CCE [

Visualization of the Iris dataset before and after preprocessing by D-IMPACT. Visualization of the original dataset is shown in the bottom-left triangle. Visualization of the dataset optimized by D-IMPACT is shown in the top-right triangle

Visualization of the first four features of the Wine dataset before and after preprocessing by D-IMPACT. Visualization of the original dataset is shown in the bottom-left triangle. Visualization of the dataset preprocessed by D-IMPACT is shown in the top-right triangle

outliers based on the dendrogram. These results were then compared with the reported result of CCE. This was done by calculating the accuracy and precision values. The results in

In this study, we proposed a data preprocessing algorithm named D-IMPACT inspired by the IMPACT clustering algorithm. D-IMPACT moves data points based on attraction and density to create a new dataset where noisy points and outliers are removed, and clusters are separated. The experimental results with different types of datasets clearly demonstrated the effectiveness of D-IMPACT. The clustering algorithm employed on the datasets preprocessed by D-IMPACT detected clusters and outliers more accurately.

Although D-IMPACT is effective in the detection of noise and outliers, there are some difficulties remaining. In the case of sparse datasets (e.g., microarray data and text data), the approach to noise detection based on the density often fails since most of the data, including noise and outlier points, will have a density which equals 1 under our definition. In addition, the distances between data points are not so different due to the curse of dimensionality. In order to overcome this problem, we consider an attraction measure between two data points. The attraction of a noise or outlier point is usually small since it is far from other data points. These problems may be overcome by using the density and attraction information to detect these data point types.

Visualization of the Iris and Wine datasets preprocessed by CLUES. a) Iris; b) Wine

. Accuracy and precision values of noise and outlier detection on the lung cancer dataset

Preprocessing algorithm | Linkage | Accuracy | Precision |
---|---|---|---|

None | Single | 0.718 | 0.5 |

None | Average | 0.343 | 0.556 |

None | Complete | 0.125 | 0.222 |

D-IMPACT | Single | 0.781 | 0.667 |

D-IMPACT | Average | 0.812 | 1 |

D-IMPACT | Complete | 0.812 | 1 |

CCE | N/A | 0.75 | 0.6 |

Dendrograms of the clustering results on the WTP dataset. a) Dendrogram of the original water-treatment dataset; b) Dendrogram of the water-treatment dataset after being preprocessed by D-IMPACT; c) Dendrogram of the water-treat- ment dataset after being preprocessed by CLUES

The algorithm D-IMPACT is implemented in C++. For readers who are interested in this work, the implementation and datasets are downloadable at [

In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, The University of Tokyo. Additional computation time was provided by the super computer system in Research Organization of Information and Systems (ROIS), National Institute of Genetics (NIG).