Journal of Biosciences and Medicines, 2014, 2, 1218 Published Online June 2014 in SciRes. http://www.scirp.org/journal/jbm http://dx.doi.org/10.4236/jbm.2014.24003 How to cite this paper: Chen, K.K., et al. (2014) Data Classification with Modified Density Weighted Distance Measure for Diffusion Maps. Journal of Biosciences and Medicines, 2, 1218. http://dx.doi.org/10.4236/jbm.2014.24003 Data Classification with Modified Density Weighted Distance Measure for Diffusion Maps KoKung Chen1, ChihI Hung1, BingWen Soong2, HsiuMei Wu3, YuTe Wu1,4*, PoShan Wang5* 1Department of Biomedical Imaging and Radiological Sciences, National YangMing University, Taip ei, Taiwan 2Department of Radiology, Taipei Veteran General Hospital, Taipei, Taiwan 3Department of Neurology, Taipei Veteran General Hospital, Taipei, Taiwan 4Institute of Biophotonics, National YangMing University, Taipei, Taiwan 5The Neurological Institute, Taipei Municipal GanDau Hospital, New Taipei, Taiwan Email: *ytwu@ym.edu.tw, *b8001071@yahoo.com.tw Received February 2014 Abstract Clinical data analysis is of fundamental importance, as classifications and detailed characteriza tions of diseases help physicians decide suitable management for patients, individually. In our study, we adopt diffusion maps to embed the data into corresponding lower dimensional repre sentation, which integrate the information of potentially nonlinear progressions of the diseases. To deal with nonuniformaity of the data, we also consider an alternative distance measure based on the estimated local density. Performance of this modification is assessed using artificially gen erated data. Another clinical dataset that comprises metabolite concentrations measured with magnetic resonance spectroscopy was also classified. The algorithm shows imp roved resu lts c om pared with conventional Euclidean distance measure. Keywords Diffusion M ap s, Density Estimation, Spinocerebellar Ataxia 1. Introduction Exploring the behavior and patterns of clinical data is crucial, and is mostly done with statistics or linear analy sis. But several factors may make these approaches incapable of dealing the data effectively and efficiently. Sometimes the higher dimensional data may actually lie on a lower dimensional space. Generally, classical sta tistical methods and linear analysis may not provide insightful information on such kind of data. A new ap proach, namely diffusion maps [1], had been proposed to deal with high dimensional data. Based on the stochas *
K.K. Chen et al. tic process on the spectral graph theory, diffusion maps is among the most powerful spectral dimensionality re duction tool to locate intrinsic lower dimensional coordinates of a given multidimensional dataset [2]. Diffusion Maps has been applied to diverse applications [3][7]. The Diffusion Maps uses a distance measure that preserves local information of a given dataset. The distance between a pair of data points is short providing there exist some paths connecting them; that is, the affinity for this pair of points is high. This characteristic is likely in clinical data analysis since the distribution patterns of the patients do not always behave in a linear sense, and the progression of diseases and symptoms mimics the concept of local connectivity providing the data exhibits certain longitudinal behavior. In addition to capable tracking down the nonlinear structure, diffusion maps also reduce the dimensionality of the data, which simpli fies characterization and differentiation of different groups. However, the irregular distribution of the data, along with relative small population size and other factors discussed above, complicate the discerning of the data. Un der these conditions, diffusion maps with a naïve distance measure no longer suit for such task. In this study, we use a selftuning kernel, which is coupled with a density estimator, to adjust the bias intro duced by the underlying distribution of the data. The main advantage of this approach lies in its ability to deal with false and missing cluster induced by nonuniform density. An artificially generated data using Gaussian dis tribution is given in later section to illustrate this phenomenon. Another clinical dataset is also analyzed with the same method, compr ising spinocerebellar ataxia type 3 (SCA3) patients, multiple system atrophy (MSA) pa tients, and normal subjects. 2. Materials and Met ho d s 2.1. Diffusion Maps For a given measure (Χ, μ), a dataset X consists of N samples with underlying distribution μ being included. The data points may be characterized as xi = (yi1, yi2, …, yil, …, yim), i = 1, 2, …, N, l = 1, 2, …, m, A kernel k: X × X → R is defined to measure the pairwise similarities between every pair of data points. The kernel function is nonnegative, and it defines certain notion of connectivity between data points pairwisely. Since the design of the kernel will influence the geometry captured by diffusion maps, the choice of the kernel should be guided by the characteristics of the data or prior knowledge that one bears in mind. A popular choice for distance measure is the Gaussian kernel: () 2 ,2 exp ij ij xx k xx σ −− ≡ (1) Once the choice of kernel is determined, the mass can be defined as: ( ) ()( ) , 0 i Xijj mxk xxdx µ ≡∫ > (2) Then a weighting function can be built by normalizing the kernel using mass: ( )( ) ( ) , , ij ij i kxx pxx mx ≡ (3) Since the weighting function satisfies ∫X p(xi, xj)dμ(xj) = 1, the constructed graph can be viewed as an asymme tric Markov chain built over the data, where the p(xi, xj) is interpreted as the probability for state xi transits to state xj in a single time step. A square matrix P whose elements are p(xi, xj) is then constructed. Taking powers of P, which is equivalent to drive the Markov chain forward, will reveal corresponding intrinsic geometry of the data. If one allow the Markov chain running unceasingly, all the data points will be merged together and re garded as a single cluster. As long as the matrix P is nonsingular, it can be written in quadratic form: where ν is the discrete set of eigenfunctions {ν(i): i = 1, 2, …, N} with corresponding eigenvalues {(λ(i))t: i = 1, 2, …, N}. The sequence of eigenvalues has the property such that . Since the
K.K. Chen et al. sequence of eigenvalues tends to zero, a few largest eigenvalues and their corresponding eigenfunctions can be used to approximate the P with minimal truncation error. The diffusion maps is then defined as: ( ) ( ) ( ) ( ) ( ) ( ) { } j , 1,2,, t tj ii xxj N λν Ψ≡ =… (4) The dimension of the new embedding depends on only the powers of the P, not that of the original space. 2.2. Weighted Kernel Based on Density Estimation As mentioned in previous section, the choice of the kernel should be guided by application itself. The design of the kernel influences the resulting embedding due to the fact that the structure of the constructed Markov chain is altered. Even if the kernel is drawn from one of the known parametric family of distributions, tweaking its parameters may yield quite distinct results, this is especially true if the underlying distribution function of the data is irregular. Different setting of parameters of a global measure, taking the Gaussian kernel for example, leads to embeddings that differ from one another. If the scaling parameter σ is too small, the resulting graph would look like a series of mutually disconnected islands; however, setting σ too large would glue all data points altogether, leaving only a single cluster in the diffusion embedding. While a global setting captures the intrinsic geometry of the data, it would not be able to effectively address the nonuniformity of the intrinsic density distribution. When treating data comprises different groups, it is de sirable that the kernel can be selftuning based on the local statistics of the data points. In our case, in order to compensate the bias and skewness introduced by the distribution of the data, we consider the local density of the data points. Density plays an important role in statistics; it conveys the distribution pattern to be drawn from the data. There’s a vast literature focus on density estimation [6]. In our case, consider any point xi in the original data, let the set ( ) { } :,1, , i jiji xx xxjn ξε ≡−< = being its neighbor. Providing one assume that the data is drawn from the normal distribution N(u,τ2) and the variance of all dimensions are the same, then the ratio that the local variance of data points in the set ξ(xi) to the global variance τ2 should depend on the size of the local set, that is, ni. One may further assume that if this ratio of associated with ξ(xi) surpasses certain predefined value, then data points in the ξ(xi) are actually discernable. Since the density of ξ(xi) is proportional to its sample size, we can use ni as a density estimator to formulate the selftuning kernel. Alternatively, one can use the following integral to estimate the local density as delineated by Silverman [6]: ( ) ( ) ( ) ,id ij xi dxkxxdx ξ ≡∫ (5) where kd is a Gaussian function that takes the same form as (1), except for the fact that σ is replaced by σd (In our case, we assume σd to be the ε). The purpose of kd is to measure the contribution of xj to the local density of the xi. For normal distribution, it can be shown that estimated d(xi) will converge to ni as the sample size N is sufficiently large. Then for every pair estimated densities of samples, the lower one will be incorporated into the Euclidean distance measure to form a weighted version: ( ) ( ) ( ) ( ) { } 2 2 ,min,, 1,, i jijiljl xi Dw xxdxdxyydxlm ξ ≡∫−= … (6) In the same manner as (1), Dw2(xi, xj) is then used as the distance measure of the selftuning kernel: ( )( ) 2 ,2 , exp ij ij Dwx x axx τ − ≡ (7) This approach is more flexible, and can reduce the possibility that samples with different characteristics being identified as the same due to nonuniform density of the data. Furthermore, the method is nonparametric; this feature is desirable since no additional prior or background knowledge is required for clients to obtain meaning ful results. The computation procedure of the aforementioned approach is listed in the following flowchart:
K.K. Chen et al. A result using artificially generated data is given in Figure 1. We assume that the local density of the gray zone is generally higher than a clean group, since it is a mixture of subjects from different clusters; also, if there are multiple clusters with varied density appear at once, it is unlikely that a naïve kernel would be able to treat the data properly. Each dataset is randomly divided into training set and testing set. The training set is fed to train the support vector machine (SVM) first, and then evaluating the performance of SVM with the testing set. The results show that the overall classification ratio has been improved using the modified method. An important characteristic of such spectral clustering techniques is that they are feasible only if the different groups can be separated in the lower dimensional representation [8]. This issue arises in our study of the first clinical dataset, where the classification accuracy of the MSA groups in the diffusion embedding actually de creases in comparison to that using original data. 3. Experimental Re s ults The clinical dataset compris e relative metabolite concentrations measured using magnetic resonance spectros copy (MRS). The MRS is carried out on left and right cerebellum, left and right basal ganglia, and vermis. The relative concentration of three different metabolites, namely Nacetylaspartate (NAA), Choline (Cho), and myo inositol (mI), are measured at all five anatomies; these three concentrations have been normalized using the concentration of creatine (NAA/Cr, Cho/Cr, mI/Cr). The SCA dataset consists of three different groups, namely the 63 SCA3 patients, 98 MSA patients, and 44 normal subjects. While the SCA and MSA share similar clinical symptoms, the MRS has been shown to be a potential modality to differentiate SCA and MSA, particularly mul tiple system atrophycerebellar type (MSAC) [9]. While the MSA patients are readily separable from the other two groups, classifying SCA3 patients and normal subjects is more difficult, as shown in Figure 2(A). Using diffusion maps, we obtain an embedding that separate the original data into three clusters, but all of them are mixture of different groups. This suggests a naïve distance measure is unsuitable. Density estimation is first performed on the original data (Figure 2(B) and Figure 2(C)), then incorporated into the distance measure as selftuning factor. Figure 3 shows embeddings obtained with simple kernel and density based kernel, respec tively. The classification is carried out on four di fferent setting, namely original data, principal component based representation, diffusion embedding, and diffusion embedding with density based kernel built in, respectively. The SVM is performed thirty times for each setting. The classification accuracy is listed in the Table 1. 4. Discussion While the overall classification performance is ameliorated in general, particularly in the case of normal subjects, the accuracy of the MSA group actually drops. The issue may be caused by a variety of factors, and we suspect the most likely source being the overly estimated density, namely d(xi). We illustrate this by considering the
K.K. Chen et al. Figure 1. Demonstration of two different approaches using artificially generated data. The setup for this data constitutes three Gaussian distribution: N((1, 0, 0),1), N((−1, 0, 0),1), and N((0, 0, 0), 2). (A) The original data displayed in 3D. The red group disperses through both the green one and the blue one. One can consider the higher density clusters as typical representative of certain disease, or the gray zone induced by overlapping interval between different groups; (B) The dif fusion coordinate computed using Euclidean distance measure. The red group is incorrectly regarded as the background of the data due to its lower density; (C) Diffusion coordinate obtained with selftuning kernel. The dispersive red cluster is glued altogether and can be easily identified (D) 10 samples randomly generated from normal distribution, originally the distances between every local pair differ a lot, but the differences have been toned down after the dfensity weighted dis tance measure is used. Figure 2. The Intermediate result of density estimation performed on the clinical dataset. (A) The original data in 3D view. The dimensions being chosen as coordinates are: NAA/Cr in both left and right cerebellum, and Cho/Cr in left cerebellum; (B) The contour of bivariate density estimation using NAA/Cr in both left and right cerebellum; (C) The corresponding 3D contour of the estimated density. It can be inferred that not only SCA3 has lower density, but the gray zone induced by both normal subjects and SCA3 group complicate the situation further. Table 1. Classification accuracy using SVM of SCA dataset. SVM Classification Accuracy Original Space PCA Based Naïve Kernel Modified Kernel SCA3 64%  73% 65%  74% 79%  88% 86%  93% MSA 81%  93% 46%  50% 71%  82% 73%  82% Normal 62%  73% 74%  79% 50%  63% 79%  88%
K.K. Chen et al. Figure 3. Diffusion coordinate of SCA dataset obtained with different approaches. (A) Embedding computed with the naïve distance measure. The data roughly is divided into three distinct clusters, all of which are mixture of different groups; (B) Result ob tained using selftuning kernel. The false clusters are now patched altogether. embedding of simulated data in Figure 1(C). It is evident that the perturbations within green and blue groups are both magnified as integrating the red group. Based on this observation, we conjecture that ineffective estimation of local density, which may potentially destroys originally compact and dense structure, can be hazardous; hence correct density estimation is crucial to the efficacy of the selftuning kernel. This issue can be formerly charac terized as correct estimation of certain intrinsic parameters of the distribution function given the data. 5. Conclusion Based on the clustering properties of the diffusion maps, we analyze the clinical data in a lower dimensional space induced by distance measure of the diffusion maps. To adjust the nonuniformity introduced by the under lying distribution of the data, we estimate the local density of the data, and use it as selftuning factor of the dis tance measure. This approach shows satisfactory results on both artificial data and metabolite concentrations obtained with MRS. The results also shows that distance measure with scaling factor based on variance of local mean generally is more capable than a naïve kernel. Acknowledgements This study is supported by National Science Council, Taiwan (NSC 1012221E010004MY2 & NSC 101 2314B733001MY2) and by the project of Center for Dynamical Biomarkers and Translational Medicine, Na tional Central University, Taiwan (NSC 1022911I008001). References [1] Coifman, R.R. and Lafon, S. (2006) Diffusion Maps. Applied and Computational Harmonic Analysis, 21, 530. http://dx.doi.org/10.1016/j.acha.2006.04.006 [2] Si nge r, A. and Coifman, R.R. (2008) No nLinear Independent Component Analysis with Diffusion Maps. Applied and Computational Harmonic Analysis, 25, 226239 . http://dx.doi.org/10.1016/j.acha.2007.11.001 [3] Lafon, S., Keller, Y. and Coifman, R.R. (20 06) Data Fusion and Multicue Data Matching by Diffusion Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 17841797. http://dx.doi.org/10.1109/TPAMI.2006.223 [4] Nadl er, B., Lafon, S., Coifman, R.R. and Kevrekidis, I.G. (2005) Diffusion M aps, Spectral Clustering and Eigenfunc tions of FokkerPlank Operator s. Neural Information Process. System 18, MIT Press, 955962. [5] Si nge r, A., Shkolnisky, Y. and Nadler, B. (2009) Diffusion Interpretati on of Nonlocal Neighborhood Filters for Signal Denosing. SIAM Journal Imaging Science, 2, 118139. http://dx.doi.org/10.1137/070712146 [6] Silverman, B.W. (19 86) Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability, Vol. 26, Chapman and Hall, London. [7] Etyngier, P., S’egon n e and Kwriven, R. (20 07) Shape Prior Using Manifold Learning Techniques. Proceedings of the IEEE International Conference on Computer Vision, 15, 132141. [8] Chandola, V., Banerjee, A. an d Kumar, V. (20 09) Anomaly Detection: A Survey. ACM Computing Surveys, 41, 158.
K.K. Chen et al. [9] Liring, J.F., Wang, P.S., Chen, H.C., S oon g, B.W. and Guo, W.Y. Differences between Spinocerebellar Ataxias and Multiple System AtrophyCerebellar Type on Proton Magnetic Resonance Spectroscopy. PLoS ONE, 7, e47925. http://dx.doi.org/10.1371/journal.pone.0047925
