Engineering, 2013, 5, 7883 doi:10.4236/eng.2013.55B016 Published Online May 2013 (http://www.scirp.org/journal/eng) MetasampleBased Robust Sparse Representation for Tumor Classification Bin Gan1, ChunHou Zheng1,2, JinXing Liu1 1College of Information technology and communication, Qufu Normal University, Rizhao, China 2College of Electrical Engineering and Automation, Anhui University, Hefei, China Email: ganbinganxinhai@126.com Received 2013 ABSTRACT In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classification. Firstly, we extract Metasample from trainning sample. Secondly, a weighted matrix W is added to solve an l1regular ized least square problem. Finally, the testing sample is classified according to the sparsity coefficient vector of it. The experimental results on the DNA microarray data classificati on pr o ve that the pr oposed algorithm is efficient. Keywords: DNA Microarray Data; Sp arse Representation Classificatio n; MRSRC; Robust 1. Introduction A tumor is a neoplasm from an abnormal growth of cells. An accurate, effective and prompt treatment of tumor is necessary for patient. But before treatment, how to clas sify the tumor is a more important mission because tu mors have many types. If you make a wrong analysis, your treatment will become another killer. DNA microarray is a biotechnology that simultane ously monitors the expression of tens of thousands genes in cells. There are many methods have been used in tu mor classification through microarray gene expression profiling like independent component analysis (ICA), nonnegative matrix factorization (NMF), i.e. Since the biological data is too large in scale and too complicated in profiling, which not only bring a great difficulty to save, search, process or analyze these data, but also bring an big challenge to data mining technology. A new effi cient data mining technology is necessary for improving the accuracy of tumor classification. Sparse representation has been successfully used in image processing applications [1], DNA microarray data classification [2], and Text classification. Intuitively, the sparsity of the coding coefficient vector for samples can be measured by the 0norm or norm minimization (norm minimization is the closest convex function to 0norm minimization) of it. The norm minimization is widely applied in sparse coding. Generally, the sparse coding problem can be formulated as l1 l 1 l 1 l l 1 min subject to 2 2 yD (1) where y is a given signal, e.g. the gene expression profile of a sample. D is the dictionary of coding atoms, is the coding vector of y over D and > 0 is a constant. The processing means of this function for gene ex pression data can be interpreted as following: By coding DNA microarray data y as a sparse linear combination of the training samples D via the 0 lnorm or norm minimization(here we used norm minimization) in above function, SRC(Sparse Representation Classifica tion) classifies y by evaluating which class of training samples could result in the minimal reconstruction error of it with the associated coding coefficients. 1 l 1 l But there are two important issues in this model. The first one is that whether 1 is well enough to represent the signal sparsely. there are many works having been done for it. For example, adding a nonnegative constraint to [3]; introducing a Laplacian term of coefficient in sparse coding [4]; Designing the sparsity regularization terms by using the Bayesian methods [5] and using the weighted 2norm in the sparsity constraint [6]. So, the first issue has been solved well. The Second one is that whether the 2 lnorm term l 2 2 yD is effective enough to represent the sign al fidelity when y is noisy or has many outliers. By now there are only in [7,8] the norm was used to keep the coding fidelity of 1 l 1 yD . Since we assume that the testing signal y can be represented by the training sample D. But in prac tice this assumption may not hold well as the noisy or outliers. So the 2 l or norm may not be robust enough in DNA microarray data classification. 1 l To improve the robustness and effectiveness of sparse Copyright © 2013 SciRes. ENG
B. GAN ET AL. 79 representation based classification. In this paper, we propose a MetasampleBased Robust Sparse Representa tion Classifier (MRSRC) for DNA microarray data clas sification using Sparse Coding and Robust theory. A weighted matrix W is added in regularized least square problem to reduce noisy or outliers. 1 l The rest of this paper is organized as follows. Section 2 presents the proposed MRSRC model and algorithm. The metasample model of gene expression data are first presented, then MRSRC is given. Section 3 presents the numerical experiments. Section 4 concludes the paper. 2. MRSRC 2.1. Classification Based On Sparse Representation Sparse representation (SR) is a new effective method based on norm minimization of data processing. By using SR technique, we can represent a new sample as a linear combination of training sample set with sparse coefficient, and classify new sample based on the sparse coefficient vector. Sparse Representation Classification (SRC) has been used in face recognition, tumor classifi cation [9] and performed well. The core idea of the SRC is that a test sample can be well expressed by only using the same class training sample. And in this condition, there are only a few nonzero elements in the coefficient vector of SR. In SR, we use norm sparsitycon strained least square estimation to get coefficient vector. SRC is different from traditional supervised learning algorithm in that SRC hasn’t strict training and test proc ess. So it has no over learning problem. 1 l 1 l However, using original training sample as dictionary directly can not express a new test sample well enough sometimes. At the same time, if there are too much training samples, the speed of algorithm will slow down. To solve this problem, we first extract metasamples from each class training samples, then using them to construct the dictionary. The detailed processes are listed in the following Section. 2.2. Maint aining the In tegrity of the Specific ation A typical characteristic of DNA microarray data is that gene amount is much more than the number of samples. Generally, the number of samples is about hundreds, but there are thousands of genes in each sample. Which makes that many classic classification methods can’t be used in DNA microarray data analysis. Fortunately, methods of choosing related features or extract new fea tures can solve well this issue. So far, there are many documents have studied how to use genes selection to classify the tumor samples, like [1012], and how to ex tract new features, e.g., metasamples [16]. Alter et al. [13] used SVD to transform the gene expression data from the “genes samples” space to diagonalized “eigengenes eigenarrays” space, where the eigengenes are unique orthonormal superpositio n of the genes. Brunet et al. [14] used NMF to describe the gene expression data through a fe w of metasamples. (Figure 1) Metasamples of gene expression data is defined as a linear combination of several samples. We factorized the gene expression data set matrix A into two matrices ~ WH , (2) where matrix A is of size . In matrix A, the col umn represents the expression level of genes in a sample. Each row means the expression level of one gene through all samples. Matrix H is of size. Matrix W is of size nm km nk . From the above analysis, it can be seen that there are many methods to extract the metasamples, such as SVD, PCA [13], ICA [15], NMF [14], etc. In consideration of algorithm’s simple and fast features, in this paper we use SVD to extract the metasamples. We extract metasamples from the samples in class i and denoted them as i, then put them together, which constructed the new dictionary. W D = 12 ,,, k WW W , (3) 2.3. The RSRC (Robust Sparse Representation Coding) model The sparse coding model in Eq. (1) is equivalent to the LASSO problem [22]: 2 2 min yD subject to 1 , (4) where > 0 is a constant, y = 12 ;;; n yy y n R is the signal to be coded, D = 12 dd,,, m d nm R is the dictionary with column vector d being the atom, and th j is the coding coefficient vector. Here the resid ual e =yD follows Gaussian distribution. We can see that a sparsityconstrained least square es timation problem is necessary to the sparse coding prob lem in Eq. (4). If e follows Laplacian distribution, this solution will be 1 min yD subject to 1 , (5) Figure 1. The metasample model of gene expression data. Copyright © 2013 SciRes. ENG
B. GAN ET AL. 80 However, the residual e may be far from Gaussian or Laplacian distribution because of noisy or outliers. So the conventional sparse coding models in Eq. (4) and Eq. (5) may not be robust and effective enough for DNA mi croarray data classification. In order to build a more ro bust model for DNA microarray data classification, we rewrite D as D= 12 ;;; n rr r , where row vector is the row of D, and set e = i r th iyD = 12 ;;; n ee e . Then we get iii eyr . And i e are distributed according to some probability density function () i e With consideration of the sparsity constraint of . , the RSRC can be formulated as [23]: 1 min( ) n ii iyr subject to 1 , (6) where () i e = ln() i e and it has the following properties: (0) is the global minimal of () i e ; () i e = () i e ; () i e <() j e if i e>j e, and we let (0) = 0. From Eq. (6), we can see that the propose RSRC model is a more general sparse coding model. Eq. (4) and Eq. (5) are special cases of it when it follows Gaussian and Laplacian distributions. Now by solving Eq. (6), we can get the coding coeffi cient vector . But one key problem is how to determine the distribution of . From above analysis we can see that taking as Gaussian or Laplacian distribution directly is not effective or robust enough. So we do not determine directly to solve Eq. (6). Insteadly, we transform Eq.(6) into an iteratively reweighted sparse coding problem[23]: 2 1/2 2 min ()WyD subject to 1 , (7) where W is a diagonal matrix: ' ,0, 0, () ()/ ijii i Wwe ee 0, , (8) Eq. (7) is a weighted LASSO problem. Because W needs to be estimated by using Eq. (8). Eq. (7) is a local approximation of the RSRC in Eq. (6) at , and W should be updated using the residuals in previous itera tion via Eq. (8). Using Eq. (7) the determination of dis tribution 0 e is transformed into the determination of W. As the logistic function has properties similar to the hinge loss function in SVM, the weight function can take 22 () exp()/(1exp()) ii weu ueu ue i , (9) where u and are positive scalars. u controls the de creasing rate from 1 to 0, and controls the location of demarcation point. Through Eq. (9), Eq. (8) and (0) = 0, we get 2 1 ()(ln(1 exp()) ln(1exp())) 2 ii euue u , (10) The sparse coding models in Eqs. (4) and (5) can be interpreted by Eq. (7). When = 2, we will get the model in Eq. (4). If we let i = () i we ()we 1/ i e, we can get the model in Eq. (5). Eq. (7) has the following advantage: outliers will be assig ned with low weights to reduce th eir affects. The weighted function of Eq. (9) is bounded in [0,1]. According to Eq. (7, 8, 9, 10), the detailed algorithm to get is following as: Start from t = 1: 1) Compute resisual . ( is first ini tialed in Eq.(11)) () () rec y t eyt()t rec y 2) Estimate weights as () ()() () ( ( tt t tt () 2 () ())() 2 exp() ) () 1exp(() ) t ti it t i e we e , (14) where ()t and ()t are parameters estimated in the iteration. th t 3) Sparse coding: 2 *()1/2 2 min ()() t WyD subject to 1   where is the estimated diagonal weight matrix with . ()t W()tt() ()Wwe ,ii i 4) Update the sparse coding coefficients: If t = 1, () *t ; If t >1, ; ()(1tt)()*(1) () t t ()t where 0 < < 1 is the step size. ()t can be searched from 1 to 0 by the standard linesearch process. 5) Compute the reconstructed test sample: ()t rec y = D()t , and let t = t + 1. 6) Return step 1 until convergencing, or reached the maximal number of iterations. Iterations finished and output sparsity coefficient vec tor . 2.4. Algorithm of MRSRC When we get a testing sample y, in order to initialize the weight, we set e as ini eyy y . In this paper, we com pute as ini y ini D m (11) where m is the mean of all training samples. At this algorithm, W will change as e in Eq. (8) at every iteration. We stop the iteration if the following condition satisfying : ()( 1)( 1) 22 / tt t WW W < , (12) where is a small positive scalar, t is amount of itera tion. After the iteration, we get the coefficient , then clas sify y using the following function: 2 min( )() i iryyDi (13) where = ˆi y() i D is a rebuilt testing sample by the class metasamples. Then we can classify y according to th i Copyright © 2013 SciRes. ENG
B. GAN ET AL. 81 the difference between y and . For example, if the difference between y and is smallest, y is classified to the class. ˆi y 3 ˆ y 3th The classification algorithm is summarized as follows: Input: matrix of training samples A = [12 ,,, k AA n] for k classes; testing sample y mn R and initialized as. (1) rec y ini y Step 1: mormalize the columns of A to have unit norm. 2 l Step 2: Extract the metasamples of every class using SVD and get D. Step 3: calculate the sparse coefficient using Eq. (7). Step 4: Compute the difference: 2 () ( ii D)ry y ( i r y rgm Output: i, i.e., The informa tion of which class y belongs to. () ain)identity y From the algorithm it can be seen that MRSRC is the combination of RSC and metasample based cluster. In MRSRC, the complexity of algorithm depends on the number of iterations t, which depends on the percentage of outliers in the DNA microarray data set. Generally, the number of iteration takes 2, unless the percentage of out liers is too big. At that instance, t should be taken about 10 to ensure the algorithm to reach convergence. 3. Experimental Results In this section, experiments were performed to show the efficiency of the proposed method. 3.1. Parameter Selection In the weight function Eq. (9), there are two parameters, i.e., and . is the parameter of demarcation point. We compute the value of as follows. Denote by22 ) ,2 12 [(),(,]ee e( ) n . We sort ele ments in an ascending order, then we can get a new array . Letkn , where scalar 0,1 , and n outputs the largest integer smaller than n . We set as k (15) controls the decreasing rate of weight value from 1 to 0. We set/c , where c is a constant. According to lots of experiments, we usually set c as 8, and set as 0.8 in our experiments. 3.2. TwoClass Classification In this subsection, we use three microarray data set to study the tumor classification, they are Colon cacer data set, Prostate cancer data set and DLBCL data set. The data set datails are listed in Table 1. All these three data sets have two class samlples. Colon data set has 2000 genes, Prostate data set has 12600 genes and DLBCL data set has 54 69 genes. We used the proposed to classify these dataset. Fro comparison, we also used other three methods, i.e., SVM, LASSO and SRC, to classify these experimental datasets. The classification results are listed in Table 2. In our method, SVD is used to extract the metasamples of gene expression data. Here, we choose 3 dimensions' meta samples when we extract these two class samples. And as the samples are not big enough, we use the nested strati fied 10fold cross validation to get a more accurate re sult. From Table 2 we can see that, MRSRC have a good classification performance in Colon cancer datasets and DLBCL datasets. But in Prostate cancer datasets, even MRSRC is not better than SRC, but it has an advantage over SVM and LASSO. SRC is best in Prostate datasets, but as not well as MRSRC in the other two datasets. In all, in this TwoClass Classification experiment, MRSRC has well results. To better illustrate results, we show the accuracies of our methods MRSRC in Figure 2 when t = 10. The nested stratified 10fold cross validation is also be used. Table 1. Three type of tumors for DNA classification ex periments. samples Datasets Class 1 Class 2 Genes Colon cancer 40 22 2000 Prostate cancer 77 59 12600 DLBCL 58 19 5469 Table 2. The classification accuracies by different methods. Datasets SVM LASSO SRC MRSRCSVD Colon cancer 85.48 85.48 85.48 90.32 Prostate cancer 91.18 91.91 94.85 93.38 DLBCL 96.10 96.10 97.40 98.70 Figure 2. The classification accuracy with t = 10. Copyright © 2013 SciRes. ENG
B. GAN ET AL. 82 The xaxis represents the kdimension. The yaxis repre sents the accuracy of the classification. From Figure 2, we can see that when the dimension of metasample is 3, the best classification accuracy can be reached. This re sult fully shows the ad vantage of metasample in redu cing calculation complexity. 3.3. Multiclass Classification To further investigate the performance of our method, we also used five multiclass data sets to do experiment. All the five data sets were produced by oligonucleotide mi croarrays. They are the Lung cancer data set [17], which contains 4 lung cancer types, includes 203 samples with 12600 genes. The Leukemia data set [18], which contains 3 kinds of samples, includes 72 samples with 11225 genes. The SRBCT(Small round blue cell tumors of childhood) data set [19], which contains 4 types of tu mors, includes 83 samples with 2308 genes. The 11_Tumors data set [20], which contains 11 various hu man tumor types, includes 174 samples with 12533 genes. The 9_Tumors data set [21], which contains 9 various human tumor types, includes 60 samples with 5726 genes. The results are listed in Table 3. From the experiments we can see that, for multiclass classification the proposed MRSRC does not have clear advantages over SVM and SRC. The reason is that the training samples are very few so that the extracted metasamples cannot fully represent the information of these classes. For example, 9_Tumors data set only have 60 samples but 9 classes. One class only has 7 samples so that the training samples can not fully represent the testing sample. 4. Conclusions In this paper, using Sparse Coding and Robust theory, we proposed a MetasampleBased Robust Sparse Represen tation Classifier (MRSRC). Comparing MRSRC with SRC and SVM on various types of DNA microarray data, the experimental results valida ted that MRSRC is effective and efficient in tumor classification. One important advan tage of MRSRC is that MRSRC show robustness to various types of outliers or noisy because the reweighted func tion can reduce the outlier’s affection in each iteration. Table 3. The Multiclass Classification Accuracies by Dif ferent Methods. Dataset SVM SRC MRSRCSVD Lung cancer 96.05 95.07 95.07 Leukemia 96.60 95.83 97.22 SRBCT 100 100 100 11_Tumors 94.68 94.83 95.40 9_Tumors 65.10 66.67 60.00 One should be noted is that our method is based the as sumption that any testing sample can be well represented as a linear combination of the training samples from the same class. This means that the training samples should be many enough. Otherwise, the experimental results may be not famous. In future, we will use gen e selection and SVM or NMF to further reduce training samples’ dimension, speed up calculation and improve accuracy of classification. 5. Acknowledgements This work was supported by the National Science Foun dation of China under Grant No. 61272339, the Science Foundation of Anhui Province under Grant No. 1308085MF85, the China Postdoctoral Science Founda tion Funded Project under Grant No. 2012M510091 , and the Key Project of Anhui Educational Committee under Grant No. KJ2012A005. REFERENCES [1] J. Mairal, M. Elad and G. Sapirol, “Sparse representation for color image restoration,” IEEE Transactions on image processing, Vol. 17, No. 1, pp. 5369, 2008, 625. [2] C. H. Zheng, L. Zhang, T. Y. Ng, Simon C. K. Shiu and D. S. Huang, “MetasampleBased Sparse Representation for Tumor Classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 8, No. 5, 2011, pp. 12731282. doi:10.1109/TCBB.2011.20 [3] Y. N. Liu, F. Wu, Z. H. Zhang, Y. T. Zhuang and S. C. Yan, “Sparse face recognition under nonnegative curds and whey,” IEEE Conference on Computer Vision and Pattern Recognition, 2010,p. 626. [4] S. H. Gao, I. W. H. Tsang, L. T. Chia and P. L. Zhao. “Local Features Are not LonelyLaplacian Sparse Coding for Image Classification,” IEEE Conference on computer vision and pattern recognition, 2010, p. 626. [5] S. H. Ji, Y. Xue and L. Carin. “Bayesian Compressive Sensing,” IEEE Transactions on Singal Processing, Vol. 56, No. 6, 2008, pp. 23462356. [6] J. J. Wang, J. C. Yang, K. Yu, F. J. Lv, T. Huang and Y. H. Gong, “LocalityConstrained Linear Coding for Image Classification,” IEEE Conference on Computer Vision and Pattern Recognition, 2010, p. 626. [7] J. Wright and Y. Ma, “Dense Error Correction Via l1 Minimization,” IEEE Transactions on Information The ory, Vol. 56, No. 7, 2010, pp. 35403560. [8] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry and Y. Ma, “Robust Face Recognition Via Sparse Representation,” IEEE Transactions on pattern analysis and machine in telligence, Vol. 31, No. 2, 2009, pp. 210227. [9] X. Huang and F. X. Wu, “Sparse Representation for Classification of Tumors Using Gene Expression Data,” Journal of Biomedicine and Biotechnology, Vol. 2009, No. 1, 2009, pp. 16. doi:10.1155/2009/403689 [10] T. R. Golub, D. K. Slonim et al., “Molecular Classifica Copyright © 2013 SciRes. ENG
B. GAN ET AL. Copyright © 2013 SciRes. ENG 83 tion of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, No. 286, 1999, pp. 531537. doi:10.1126/science.286.5439.531 [11] J. G. Liao and K. V. Chin, “Logistic Regression for Dis ease Classification Using Microarray Data: Model Selec tion in a Large p and Small n Case,” Bioinformatics, Vol. 23, No. 15, 2007, pp. 19451951. doi:10.1093/bioinformatics/btm287 [12] H. H. Zhang, J. Ahn, X. L in an d C. Par k, “Gen e Sele ction Using Support Vector Machines with NonConvex Pen alty,” Bioinformatics, Vol. 22, 2006, pp. 8895. doi:10.1093/bioinformatics/bti736 [13] O. Alter, P. O. Brown and D. Botstein, “Singular Value Decomposition for GenomeWide Expression Data Proc essing and Modeling,” Proceeding of the National Academy of Sciences of the United State of America, Vol. 97, 2000, pp. 1010110106. doi:10.1073/pnas.97.18.10101 [14] J. P. Brunet, P. Tamayo, T. R. Golun and J. P. Mesirov, “Metagenes and Molecular Pattern Discovery Using Ma trix Factorization,” Proceeding of the National Academy of Sciences of the United State of America, Vol. 101, No. 12, 2004, pp. 41644169.doi:10.1073/pnas.0308531101 [15] D. S. Huang and C. H. Zheng, “Independent Component AnalysisBased Penalized Discriminant Method for Tu mor Classification Using Gene Expression Data,” Bioin formatics, Vol. 22, No. 15, 2006, pp. 18551862. doi:10.1093/bioinformatics/btl190 [16] W. Liebermeister, “Linear Modes of Gene Expression Determinded by Independent Component Analysis,” Bio informatics, Vol. 18, 2002, pp. 5160. doi:10.1093/bioinformatics/18.1.51 [17] A. Bhattacharjee et al., “Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Dis tinct Adenocarcinoma Subclasses,” Proceeding of the Na tional Academy of Sciences of the United State of Amer ica, Vol. 98, 2001, pp. 1379013795. doi:10.1073/pnas.191502998 [18] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D. Minden, S. E. Sallan, E. S. lander, T. R. Golub and S. J. Korsmeyer, “MLL Translo cations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia,” Nature Genetics, Vol. 30, 2002, pp. 4147. doi:10.1038/ng765 [19] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer, “Classification and Diag nostic Prediction of Cancers Using Gene Expression Pro filing and Artificial Neural Networks,” Nature Medicine, Vol. 7, No. 6, 2001, pp. 673679. doi:10.1038/89044 [20] A. I. Su et al., “Molecular Classification of Human Car cinomas by Use of Gene Expression Signatures,” Cancer Research, Vol. 61, No. 20, 2001, pp. 73887393. [21] J. E. Staunton et al., “Chemosensitivity Prediction by Transcriptional Profiling,” Proceeding of the National Academy of Sciences of the United State of America, Vol. 98, No. 19, 2001, pp. 1078710792. doi:10.1073/pnas.191368598 [22] R. Tibshirani, “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society B, Vol. 58, No. 1, 1996, pp. 267288. [23] M. Yang, L. Zhang and J. Yang et al. “Regularized Ro bust Coding for Face Recognition,” 2012.
