### Paper Menu >>

### Journal Menu >>

Journal of Intelligent Learning Systems and Applications, 2010, 2, 200-211 doi:10.4236/jilsa.2010.24023 Published Online November 2010 (http://www.scirp.org/journal/jilsa) Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion Takashi Nagatani, Seiichi Ozawa*, Shigeo Abe* Graduate School of Engineering, Kobe University, Rokkodai, Nada, Kobe, Japan E-mail: {*ozawasei, abe}@kobe-u.ac.jp Received August 20th, 2010; revised September 25th, 2010; accepted September 27th, 2010. ABSTRACT We propose the threshold updating method for terminating variable selection and two variable selection methods. In the threshold updating method, we update th e threshold value when the approximatio n error smaller than the current thre- shold value is obtained. The first variable selection method is the combination of forward selection by block addition and backward selection by block deletion. In this method, starting from the empty set of the input variables, we add several input variables at a time until the approximation error is below the threshold value. Then we search deletable variables by block deletion. The second method is the co mbination of the first method and variable selection by Linear Programming Support Vector Regressors (LPSVRs). By training an LPSVR with linear kernels, we evaluate the weights of the decision function and delete the input variables whose associated absolute weights are zero. Then we carry out block addition and block deletion. By computer experiments using benchmark data sets, we show that the proposed methods can perform faster variable selection than the method only using block deletion, and that by the threshold updating method, the approximation er- ror is lower than that by the fixed threshold method. We also compare our method with an imbedded method, which determines the optimal variables during training, and show that our method gives comparable or better variable selec- tion perf ormance. Keywords: Backward Selection, Forward Selection, Least Squares Support Vector Machines, Linear Programming Support Vector Machines, Support Vector Machines, Variable Selection According to the selection criterion used, the variable selection methods are classified into wrapper methods and filter methods. Wrapper methods use the recognition rate as the selection criterion and filter methods use the selection criterion other than the recognition rate. Wrap- per methods provide good generalization ability but usu- ally at a much larger computational cost. Although the computational cost of the filter methods may be small, it will take a risk of selecting a subset of input variables that may deteriorate the generalization ability of the re- gressor. 1. Introduction Function approximation estimates a continuous value for the given inputs based on the relationship acquired from a set of input-output pairs. As a tool to perform function approximation, Support Vector Machines (SVMs) [1,2] proposed by Vapnik attract much attention. Although SVMs are developed for pattern recognition, they are extended to solving function approximation problems such as Support Vector Regressors (SVRs) [3], Least Squares Support Vector Regressors (LSSVRs) [4], and Linear Programming Support Vector Regressors (LPSVRs) [5]. Wrapper and filter methods are used before training regressors. But because SVRs are trained by solving op- timization problems, imbedded methods, which select variables during training are developed [6]. In developing a regressor, we may encounter problems such as the high computational cost caused by a large number of input variables and deterioration of the gener- alization ability by redundant input variables. Variable selection is one of the effective ways in reducing com- putational complexity and improving the generalization ability of the regressor. Either by wrapper or filter methods, it is hard to test the performance of all the subsets of input variables. Thus, we generally perform backward selection or for- ward selection. There is also a combination of forward selection with backward selection [7,8]. In [7], sequential Fast Variable Selection by Block Addition and Block Deletion201 forward selection is applied l times followed by r times sequential backward selection, where l and r are fixed integer values. To speed up wrapper methods, some variable selection methods using LPSVRs with linear kernels [9] are pro- posed. For simplicity hereafter we call LPSVRs with linear kernels linear LPSVRs. LPSVRs are usually used to preselect input variables. After training linear LPSVRs, input variables are ranked according to the absolute val- ues of the weights and the input variables with small absolute values are deleted. But because nonlinear func- tion approximation problems are approximated by linear LPSVRs, the nonlinear relations may be overlooked. To overcome this problem, in [9], the data set is divided into 20 subsets and for each subset the weight vector is ob- tained by training a linear LPSVR. Then the variables with large absolute weights that often appear among 20 subsets are selected for nonlinear function approxima- tion. Backward variable selection by block deletion (BD) [10] can select useful input variables faster than the con- ventional backward variable selection. This method uses as a selection criterion the generalization ability esti- mated by cross-validation. To speed up variable selection, it deletes multiple candidate variables. However, by this method also it is difficult to perform variable selection for a high-dimensional data set. Furthermore, this method set the approximation error of the validation set using all the input variables as the threshold value. Therefore, if the initial input variables include irrelevant variables for function approximation, the threshold value may not be appropriate and high generalization ability may not be obtained by variable selection. To overcome the above problems, in this paper, we propose a threshold updating method and two variable selection methods based on block deletion and block addition of variables. In the threshold updating method, first we set the threshold value with the approximation error evaluated using all the variables. Then during vari- able selection process, if the approximation error better than the current threshold value is obtained, we update the threshold value by the approximation error. This prevents deleting useful variables and thus leads to find- ing the variable set that shows approximation perform- ance better than the initial set of variables. To realize efficient variable selection for a high-di- mensional data set, we combine forward selection with block addition and backward selection with block dele- tion. We set the threshold value of selection evaluating the approximation error using all the variables. Starting from the empty set of variables, we estimate the ap- proximation error when one input variable is temporarily added to the selected set of variables and rank the vari- ables in the ascending order of the approximation errors. Then we add several high-ranked variables at a time and rank the remaining variables. We iterate adding variables and ranking remaining variables and terminate addition when the approximation error is below the threshold value. After forward selection is finished, we delete variables, by block deletion, that are redundantly added. Namely, we rank variables with the ascending order of approxi- mation errors deleting one input variable temporarily, and delete high-ranked variables simultaneously. To further speedup variable selection for a large num- ber of input variables, we combine variable selection by linear LPSVRs with block addition and block deletion. Namely, we train a linear LPSVM using all the input variables and delete the input variables whose associated absolute weights are zero. After deletion, we compare the approximation error for the selected variable set with the threshold value. If the approximation error is above the threshold value, we add variables by block addition and then delete redundant variables by block deletion. If the approximation error is below the threshold we delete variables by block deletion. In Section 2, we discuss the selection criteria and the stopping conditions of the proposed method. Then in Sections 3 and 4 we discuss the proposed methods, and in Section 5, we show the results of computer experi- ments using benchmark data sets. Finally in Section 6, we conclude our work. 2. Selection Criteria and Stopping Conditions How and when to stop variable selection is one of the important problems in variable selection. To obtain a set of variables whose generalization ability is comparable with or better than that of the initial set of variables, as the selection criterion we use the approximation error of the validation data set in cross-validation. One way to obtain the smallest set of variables with high generalization ability is by the fixed threshold value. To do this, before variable selection, we set the threshold value for the selection criterion evaluating the approxi- mation error using all the variables. Let the threshold be T. Then T is determined by m ET , (1) where m is the number of initial input variables and Em is the approximation error of the validation set by cross-validation. We fix T to Em throughout variable se- lection and delete variables so long as the approximation error of the current set of variables is below the threshold or add variables so long as the approximation error is above the threshold value. This method is called the fixed Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion 202 threshold method. By the fixed threshold method, if the initial set of va- riables includes irrelevant or redundant variables, the threshold value calculated using all the variables might give a conservative threshold value. Namely, if these variables are deleted, we may obtain the approximation error smaller than that with the initial set of variables. To obtain a set of variables whose generalization ability is smaller than that of the initial set of variables, we update the threshold value when we obtain the approximation error smaller than the threshold value as follows. Let the current approximation error with j input variables be Ej. Then if TE j, (2) we consider Ej as a new threshold value, that is, we set j ET (3) and continue variable selection. We call this method the threshold updating method. By the threshold updating method, the obtained set of variables gives a smaller approximation error for the vali- dation data set than that by the fixed threshold method. But the number of selected variables may increase. 3. Variable Selection by Block Addition and Block Deletion If for a given approximation problem the number of input variables is large, backward selection will be inefficient. To speed up variable selection in such a situation, in the following we discuss the first proposed method called variable selection by block addition and block deletion (BABD), which add and delete multiple variables ac- cording to the variable ranking. 3.1. Idea Suppose for a function approximation problem with m variables, we select k variables. We examine the compu- tational complexity of selecting the set of variables either by sequential forward or backward selection with some appropriate selection criterion. First we consider select- ing k variables by sequential forward selection. Then starting with the empty set, we temporally add one vari- able at a time and evaluate the selection criterion, and permanently add one variable with the best selection criterion. In this method we need to evaluate selection criterion m + (m 1) ++ (m k + 1) = k(2m k + 1)/2 times. In this case the number of variables used for evaluating the criterion changes from 1 to k. By sequential backward selection, starting from m variables, we delete one variable temporarily at a time and evaluate the selection criterion, and permanently delete one variable with the best selection criterion. By this method we need to delete (m k) variables. The number of evaluations of the selection criterion is k(2m k + 1)/2, which is the same as that by forward selection. However, the number of variables used for evaluating the selection criterion changes from m to m k. Therefore, for k < m/2, forward selection will be faster than back- ward selection. This tendency will be prominent for the case where k m/2. From the standpoint of quality of the set of selected variables, backward selection, which deletes irrelevant or redundant variables from the variable set considering the relation between the remaining variables, is more stable than forward selection, which selects variables which are important only for the selected variables, not considering the relation with the unselected variables. For instance, the variable that is selected first will be the best if only one variable is used. But as variables are added, it may be redundant. Therefore, to speed up backward selection, we use forward selection as a pre-selector and afterwards, for the set of selected variables we perform backward selection. To speedup forward and backward selection processes, we delete or add multiple variables at a time and repeat addition or deletion until the stopping condi- tion is satisfied. 3.2. Method We explain BABD for the fixed threshold method. First, we calculate the approximation error Em from the initial set of input variables Im = {1,…, m} and set the threshold value of the stopping condition T=Em. We start from the empty set of selected variables. Assume that we have selected j variables. Thus the set of input variables is I j. Then we add the ith input variable in set I m – I j tempo- rarily to I j and calculate E jiadd, where iadd indicates that the ith input variable is added to the variable set. Then we rank the variables in I m – I j in the ascending order of the approximation errors. We call this ranking variable ranking V j. We add k (k {1, 21,…,2A}) input variables from the top of V j to the variable set temporarily, where 2A m and A is a user-defined parameter, which determines the number of added candidates. We compare the error E j+k with the value of threshold T. If T Ekj , (4) we add the variables to the variable set. If (4) is not satisfied for k=1, 21,…, 2A, we check if for some k the approximation error is decreased by adding k variables to I j : . jkj EE (5) Here we assume that E 0 = . If it is satisfied let Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion203 ij iEk Al 2,,2,1 minarg (6) and we add to I j the first k variables in the variable rank- ing. Then the current set of selected variables is I j+k. We iterate the above procedure for I j+k until (4) is satisfied. If (5) is not satisfied, we consider that block addition of variables has failed and perform block deletion from the initial set of variables. This is because if the addition of variables that does not improve the approximation accuracy continues, the efficiency of variable selection is impaired. Thus, we finish variable addition in one fail- ure. Let the set of variables obtained after block addition be I j. If block addition has failed, we set I j = I m. Now by block deletion we delete redundant variables from I j. The reason for block deletion is as follows. In block addition, we evaluate variable ranking by temporarily adding one input variable and we add multiple, high-ranked vari- ables. Namely, in block addition we do not check corre- lation of variables added at the same time. Thus, redun- dant variables may be added by block addition. We delete the ith variable in I j temporarily from I j and calculate E jidel, where E jidel is the approximation error when we delete the ith variable from I j. Then we con- sider the input variables that satisfy TE j i del (7) as candidates of deletion and generate the set of input variables that are candidates for deletion by },|{ del jj i jIiTEiS (8) We rank the candidates in the ascending order of E jidel and delete all the candidates from I j temporarily. We compare the error E j’ with the threshold T, where j' is the number of input variables after the deletion. If TE j (9) block deletion has succeeded and we delete the candidate variables permanently from I j. If block deletion has failed, we backtrack and delete half of the variables pre- viously deleted. We iterate the procedure until block deletion succeeds. When the accuracy is not improved by deleting any input variable, we finish variable selection. In the threshold updating method, we update the threshold value by (3) if for I j (2) is satisfied. In the following, we show the algorithm of BABD. The difference between the fixed threshold method and the threshold updating method is shown in Steps 3 and 6. Block Addition Step 1 Calculate Em for Im. Set T = Em, j = 0, and E 0 = . And go to Step 2. Step 2 Add the ith input variable in I m – I j temporarily to I j, calculate E jiadd, and generate V j. Set k=1 and go to Step 3. Step 3 According to the fixed threshold or threshold up- dating method, do the following. For the fixed threshold method Add the k input variables from the top of V j to I j, calculate E j+k and compare it with the threshold T. If (4) is sat- isfied, set j j + k and go to Step 4. Otherwise, if k < 2A, set k 2 k, and repeat Step 3. If k = 2A and (5) is satisfied, add the k variables to I j, where k is given by (6), and go to Step 2. If (5) is not satisfied, set j m and go to Step 4. For the threshold updating method Calculate E j+k (k = 1,21,…, 2A) and if (5) is satisfied for (6), set j j + k, T E j and go to Step 2. Otherwise, if T = E j go to Step 4. And if T < E j, set j m and go to Step 4. Block Deletion Step 4 Delete temporarily the ith input variable in I j and calculate E jidel. Step 5 Calculate S j. If S j is empty, stop variable selection. If only one input variable is included in S j, set I j-1 = I j S j, j j 1 and go to Step 4. If S j has more than one input variable, generate V j and go to Step 6. Step 6 Delete all the variables in V j from I j: I j’ = I j V j, where j' = j |V j| and |V j| denotes the number of elements in V j. Then, calculate E j’ and if (9) is not satisfied, go to Step 7. Otherwise, do the follow- ing. For the fixed threshold method Update j with j' and go to Step 4. For the threshold updating method Update j with j', T E j’, and go to Step 4. Step 7 Let V' j include the upper half elements of V j. Set I j’ = I j { V' j }, where V' j is the set that includes all the variables in V' j and j' = j |{ V' j }|. Then, if (9) is satisfied, delete input variables in V' j and go to Step 4 updating j with j'. Otherwise, update V j with V' j and iterate Step 7 so long as (9) is sat- isfied. 3.3. Complexity and Suitability of the Method To make discussions simple, we consider the fixed thre- shold method. First consider the complexity of block addition in contrast to sequential addition. For m vari- ables, assume that only one variable can satisfies the stopping condition. By sequential addition we need to evaluate the selection criterion m times and by selecting one variable the algorithm stops. By block addition, the selection criterion is evaluated m times and at the first Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion 204 The primal problem of the linear LPSVR is given by minimize M iii m iiCwbQ 1 * 1 *)(||),,,( ξξw block addition, one variable, which satisfies the stopping condition, is added and block addition terminates. Thus, the complexity of calculation is the same for the both methods. subject to ,,,1for Niby ii T i xw For the case where 2A (> 1) variables are successfully added, by sequential addition, the selection criterion is evaluated 2A (m 2A-1 + 1/2) times. Suppose that block addition of 2A variables succeeds following A failures of block addition. Then the selection criterion needs to be calculated m + A times. Therefore, for A 1 block ad- dition is faster. Here, we assume that the same sets of variables are added for both methods. But because by block addition the correlation between variables is not considered, more than m + A times calculations may be necessary. In this case added redundant variables will be deleted by the subsequent block deletion. ,,,1for *Niyb iii T xw where, xi and yi are the ith (i = 1, …, N) training input and output, respectively, wi is the ith element of the weight vector w associated with the ith input variable xi of x, b is the bias term, is the user-defined error thresh- old, C is the margin parameter, and i is the slack vari- able associated with xi. As will be explained in Section 5.2, values of C and are determined by cross-valida- tion. In the decision function of the linear LPSVR, each in- put variable xi is multiplied by an associated weight ele- ment wi. Therefore, if the weight elements are zero, the associated variables do not contribute to function ap- proximation and we can delete these variables [11]. In addition because the number of equality constrained conditions is 2M, at most 2M weight elements take non-zero values. Thus, if the number of input variables is larger than 2M, by training the linear LPSVR, we can delete at least m 2M variables whose weight elements are zero. But if we use a linear SVR, the number of non-zero variables is not restricted to 2M. Consider backward selection. For m variables, assume that we can only delete one variable, but 2i variables are candidates of deletion. By sequential deletion, we need to evaluate the selection criterion 2m 1 times but by block deletion, 2m + i 1 times because of i times failures. Thus, by block deletion we need to calculate the selec- tion criterion i more times. This is the worst case. Now consider the best case, where 2i variables are success- fully deleted by block deletion and no further deletion is possible. By block deletion, we need to calculate the criterion m + 1 + m 2i = 2m 2i + 1 times, but by sequen- tial deletion, m + (m 1) ++ (m 2i) + (m 2i 1) times. Thus, for i > 1, block deletion is faster. At first, we set the threshold of the stopping condition T = Em from the initial set of variables Im. By training a linear LPSVR, we calculate the weight elements wi (i= 1, …, m) and delete the input variables with zero weights. Therefore, the effect of block addition and block dele- tion becomes prominent as many data are successfully added or deleted. We set jLP as the number of input variables after pre- selection. Then we compare the current approximation error EJLP with the threshold T. If 4. Variable Selection by Block Addition and Block Deletion with LP SVRs TE LPj , (10) If the number of input variables is very large, even block addition or block deletion may be inefficient because during variable selection the approximation error needs to be estimated deleting or adding one input variable at a time. To overcome this problem preselection of input variables is often used [11]. We preselect variables by LPSVRs with linear kernels before block addition and deletion. We call this method variable selection by block addition and block deletion with LPSVRs (BABD-LP). If the input-output relations of the given problem are nonlinear, we may not be able to obtain good initial set of variables. This problem is solved by properly combining preselection by the LPSVR with BABD. Namely, if the approximation error obtained by preselection is not satis- factory, we add variables by block addition to the set of variables obtained by preselection. And if it is, we delete variables by block deletion from the set. we search more deletable variables by block deletion. If (10) is not satisfied, we add the variables that improve the approximation accuracy to the current variable set by block addition. After block addition is finished we delete variables by block deletion. In the threshold updating method, if (10) is satisfied, we update T by EJLP. In the following we show the algorithm of BABD-LP. After preselection, block addition and block deletion are performed. But since these procedures are the same as those in Section 3.2, we do not repeat here. Preselection Step 1 Calculate Em for Im. Set T = Em and go to Step 2. Step 2 Calculate wi (I = 1, …, m) by training the linear LPSVR. From Im delete variables whose associ- ated weight elements are zero. Let the resulting set of variables obtained by preselection be I jLP . Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion Copyright © 2010 SciRes. JILSA 205 Step 3 Calculate E jLP for I jLP. If (10) is not satisfied, set jjLP, E jE jLP, and go to Step 2 in Section 3.2. Otherwise, do the following. H(x, x') = g(x)T g(x) and is a parameter for determining the spread of the radius. For LSSVRs, we need to optimize the values of the margin parameter and the kernel parameter. But for SVRs, in addition to the two parameter values, we need to optimize the value of the tube parameter. Thus, us- ing LSSVRs we can perform faster variable selection than using SVRs. For the fixed threshold method Set j jLP and go to Step 4 in Section 3.2. For the threshold updating method Set j jLP and TE jLP and go to Step 4 in Section 3.2. 5. Performance Evaluation We determine the initial approximation error by five- fold cross-validation changing the values of the kernel and margin parameters. Namely, we train the LSSVR for all pairs of parameter values and select the values that realize the minimum approximation error for the valida- tion data set. In this section, we compare BABD and BABD-LP with BD using benchmark data sets. We also compare the proposed methods with the embedded method [9]. To distinguish between the fixed threshold method and the threshold updating method, we add subscript u to the abbreviation of the proposed method, e.g., BABDu. Then to reduce computational cost of training the LSSVR during variable selection, fixing the kernel pa- rameter value, we optimize the margin parameter by cross-validation. Thus, there is a possibility of improving variable selection by determining both kernel and margin parameter values, but with a considerable computation cost. 5.1. Use of Least Squares Support Vector Regressors as Regressors By setting some appropriate selection criterion, BABD can be used as a filter method but to obtain reliable set of variables, in performance evaluation we use the ap- proximation error for the validation data set evaluated by cross-validation by Least Squares Support Vector Re- gressors (LSSVRs). According to [10], the use of LSSVRs or SVRs in variable selection or in evaluating the approximation errors does not give much difference. Therefore, in variable selection we use LSSVRs because their training is faster than that of SVRs for small and medium size regression problems. 5.2. Benchmark Data Sets and Evaluation Conditions We use the benchmark data sets shown in Table 1, which lists the numbers of input variables, training data, test data, and the type of data set. We also list the mini- mum number of selected features found in the literature. The bottom three data sets deal with pattern classifica- tion and are used for comparing our methods with the NLPSVM (Newton method for Linear Programming SVM) [9]. The primal problem of LSSVR is given by minimize N ii TC 1 2 22 1 ww (11) To examine that our proposed methods can delete re- dundant variables, we modify the Mackey-Glass data set [12], which is generated by a differential equation with- out noise. We add 18 artificial redundant input variables generated by a uniform random variable in [0,1] to the set. Boston 5 and Boston 14 data sets [13] use the 5th and 14th input variables of the Boston data set as outputs, respectively. Excluding the Mackey-Glass data set, these subject to (12) ,,,1for)( Niby ii T i xgw where g(x) is the mapping function that maps x into the feature space. In training the LSSVR, we solve the set of linear equations that is derived by transforming the pri- mal problem into the dual problem. As a kernel function, we use RBF kernels: H(x, x') = exp( - || x x' ||2), where Table 1. Specifications of data sets and selected features. Data Set Inputs Train. Test Type Selected Mackey-Grass [12] 22 500 500 regres. － Boston5 [13,14] 13 506 － regres. － Boston14 [13,14] 13 506 － regres. 4 [15] Water Purification [12] 10 241 237 regres. － Pyrimidines [16] 27 74 － regres. 5 [17] Triazines [16] 60 186 － regres. 2 [17] Phenetylamines [18] 628 22 － regres. 30 [19] Orange Juice [20] 700 150 68 regres. 7 [21] Ionosphere [16] 34 351 － class. 11.2 [9] BUPA Liver [16] 6 345 － class. 4.9 [9] Pima Indians [16] 8 768 － class. 4.9 [9] Fast Variable Selection by Block Addition and Block Deletion 206 data sets are real data sets. In our experiments, except for the Mackey-Glass, ionosphere, BUPA liver, and Pima Indians data sets we combine the training data set with the test data set and randomly divide the set into two. In doing so, we make 20 pairs of training and test data sets for the original one pair of training and test data sets. We determine the values of the kernel parameter , the error-threshold parameter , and the margin parameter C by fivefold cross-validation. In the experiment of the regression data sets, we change = {0.1, 0.5, 1.0, 5.0, 10, 15, 20, 50, 100}, C = {1, 10, 100, 1000, 5000, 104, 105} for LSSVRs. In the experiment of the classification data sets, we change C = {100, 1000, 5000, 104, 105, 106, 107, 108}. Since we do not want to waste time in preselection and since variable selection is improved by block addition and block deletion, we set the parameters ranges for LPSVRs as follows: C = {10, 10000} and = {0.01, 0.2}. We set the user-parameter A according to the dimen- sion of the data sets; A = 3 for the data sets with less than 100 input variables and A = 5 otherwise. The approxima- tion error is measured by the mean absolute error. We use an Athlon 64 XII 4800 + personal computer (2GB memory, Linux operating system) in measuring variable selection time. 5.3. Effect of Parameters In applying BABD or BABDu we need to determine the value of A. In this section, we investigate the effect of A on the number of added variables and the approximation error using the phenetylamines data set. Because the number of input variables is 628, the maximum value of A is 10. Table 2 shows the approximation error of the validation data set after variable selection, the selected variables and the numbers of variables selected by block addition and block deletion. Since the results by BABDu for A = 7 to 10 are the same as that by A = 6 and those by BABD are the same as that by A = 2 to 10, we only list the results for A = 1 to 6. By BABD, two variables are selected by block addi- tion and no variables are deleted by block deletion. But by BABDu, the numbers of selected variables vary as the value of A changes except for A = 2 and 3. For the value of A larger than 3, some of the variables added by block addition are deleted by block deletion. Variable selection time is around five seconds by BABD and 30 to 60 sec- onds by BABDu. By the fixed threshold method, the change of A does not make much difference but by the threshold updating method the results change according to the change of A. Thus to obtain the best results we need to determine the optimum value of A by cross-validation. But because it will require much com- putation time, in our following study we fix A = 5 with variables larger than 100 and A = 3 otherwise, as stated before. 5.4. Results for the Mackey-Glass Data Set Table 3 shows the results for the Mackey-Glass 22 data set. In the table, the columns “Before” and “After” list the approximation errors before and after variable selec- tion; the column “Vali.” lists the approximation errors evaluated by cross-validation; “Test” lists the approxima- tion errors for the test data sets; The column “Selected” lists the input variables selected by variable selection; The results in the column “Before” are the same for the six variable selection methods, so we list the results only in the first row among the six rows. The columns “LP,” “BA,” and “BD” denote the numbers of variables after Table 2. Effect of the value of A to the performance for the phenetylamines data set. The initial approximation error for the validation data set is 0.156. BABD BABDu A Vali. Selected BA BD Vali. Selected BA BD 1 0.120 384, 404 2 2 0.057 124, 404, 604, … 13 13 2 0.156 123, 404 2 2 0.048 124, 330, 341, 404, … 11 11 3 0.156 123, 404 2 2 0.048 124, 330, 341, 404, … 11 11 4 0.156 123, 404 2 2 0.033 8, 60, 155, 328, 358, 364, 404, … 25 16 5 0.156 123, 404 2 2 0.025 8, 60, 157, 328, 330, 358, 404, … 45 22 6 0.156 123, 404 2 2 0.039 8, 155, 157, 341, 364, 604, … 74 28 Table 3. Performance for the Mackey-Glass data set. Before After Method Vali. Test Vali. Test Selected LP BA BD Time [s] BD 0.042 0.038 0.029 0.029 2, 4 － － 2 94 BABD 0.036 0.035 1, 2 － 2 2 87 BABD-LP 0.029 0.029 2, 4 21 21 2 201 BDu 0.019 0.018 1, 2, 3, 4 － － 4 87 BABDu 0.018 0.018 1, 2, 3, 4 － 4 4 139 BABD-LPu 0.021 0.021 1, 2, 4 21 21 3 248 Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion 207 selection by the LPSVR, block addition, and block dele- tion, respectively. Thus, for the row “BABD-LP,” 21 in the column “LP” means that by the LPSVR, 21 variables are selected from 22 variables and 21 in the column “BA,” means that because selection by the LPSVR does not worsen the generalization ability evaluated by cross-validation, variables are not added by block addi- tion. The first four variables are original input variables. Thus, from the table no irrelevant variables are selected by the proposed methods and preselection by the LPSVR and block addition does not fail selecting variables. By the fixed threshold method, the two variables are selected to realize the approximation error of the validation data set obtained by the initial 22 variables. By the threshold updating method, the three or four original variables are selected and the approximation er- rors are decreased further. Thus the threshold updating method is useful when redundant input variables are in- cluded. Since the number of variables is 22, BD or BDu is preferable from the standpoint of computation time. 5.5. Results for Function Approximation Problems We evaluate the regression problems listed in Table 1 from the average approximation errors, the average number of selected variables, and the average training time using the multiple data sets. Table 4 shows the re- sult. The column “Num.” lists the number of input vari- ables selected by variable selection; The column “Time” lists the computation time. We analyze the statistical difference of the approxima- tion errors and the computation time by the Welch t-test with a 5% significance level. We test the statistical dif- ferences of the approximation errors between the initial set of variables and the selected set of variables for the validation data sets and the test data sets. If there is sig- nificant difference, we mark the asterisk to the better re- sult. If there are significant difference among the results by BD, BABD, and BABD-LP or by BDu, BABDu, and BABD-LPu, we change the font of the best result to bold. For example, for the results of triazines data set the aster- isks in the “After” “Vali.” column mean that the ap- proximation errors of the validation sets after variable selection are significantly smaller than those before vari- able selection. And the asterisks for the BABD, BABD-LP, and BABD-LPu results in the “Before” “Test” column mean that the approximation errors before vari- able selection are significantly smaller than those after variable selection. Furthermore, the bold font in the BABD-LP (BABD-LPu) row means that the computation time of BABD-LP (BABD-LPu) is significantly shorter than the other two methods. From the table the computational time and the ap- proximation errors of BD, BDu, BABD, and BABDu are almost the same for the Boston 14, Boston 5, and water purification data sets. This means that for low dimen- sional data sets, there is not much difference between forward selection and backward selection. But for BABD-LP and BABD-LPu, it takes much time to train LPSVRs and we cannot delete many variables by the preselection. Therefore, BABD-LP and BABD-LPu are ineffective for these data sets. For the pyrimidines, triazines, and phenetylamines problems one or two input variables are enough to get a comparable accuracy as the initial threshold. In addition, many input variables are deleted by preselection using LPSVRs. Accordingly, BABD, BABDu, BABD-LP, BABD-LPu can perform high-speed variable selection. For the Orange juice problem, the standard deviations BD BABD BABD-LP Time (s) 500 10000 Approximation Error BD BABD BABD-LP Time (s) 500 10000 0 100 200 300 400 500 600 700 Number of Variables 3 3.5 4 4.5 5 5.5 Figure 1. Performance comparison for the orange juice data set by the fixed threshold method: (a) The approximation error; (b) The number of selected variables. Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion 208 Table 4. Performance comparison of the proposed methods for regression data sets. Before After Num. Time [s] Data Method Vali. Test Vali. Test BD 2.34±0.23 2.36±0.16* 2.08±0.73 2.55±0.19 7.5±2.6 21±2.6 BABD 2.28±0.21 2.51±0.18 7.3±2.3 29±9.5 BABD-LP 1.96±0.85 2.56±0.18 7.5±2.6 48±10 BDu 2.34±0.23 2.36±0.16 1.98±0.68* 2.43±0.15 10.2±1.49 18.8±4.8 BABDu 2.20±0.19* 2.40±0.17 10.1±1.50 31.5±5.4 Boston 14 BABD-LPu 1.64±0.96* 2.40±0.17 10.2±1.56 42.9±7.9 BD 0.030±0.002 0.029±0.0020.026±0.002* 0.026±0.002* 3.1±0.22 19±3.4 BABD 0.026±0.002* 0.027±0.003* 3.1±0.21 23.6±5.5 BABD-LP 0.026±0.002* 0.026±0.002* 3.1±0.22 41.6±6.5 BDu 0.030±0.002 0.029±0.0020.024±0.002* 0.024±0.003* 6.5±1.93 20.2±5.7 BABDu 0.024±0.002* 0.025±0.003* 6.4±1.96 33.7±7.6 Boston 5 BABD-LPu 0.023±0.006* 0.026±0.003* 6.5±2.29 41.2±9.3 BD 0.975±0.064 0.971±0.039 0.958±0.066 0.992±0.040 4.7±1.6 13.5±4.4 BABD *0.953±0.067 0.982±0.034 4.5±1.6 19.6±5.3 BABD-LP *0.949±0.064 0.993±0.043 5.0±1.8 28.9±5.5 BDu 0.975±0.064 0.971±0.039 0.948±0.066 0.984±0.043 6.0±1.64 14.5±4.0 BABDu 0.973±0.075 1.002±0.044 5.8±1.95 20.9±3.4 Water Purification BABD-LPu 0.765±0.388*1.007±0.050 5.6±1.83 28.2±5.2 BD 0.029±0.014 0.030±0.0070.013±0.012* 0.020±0.009* 1.1±0.22 0.75±0.54 BABD 0.011±0.009* 0.018±0.010* 1.0±0 0.50±0.59 BABD-LP 0.013±0.011* 0.020±0.009* 1.1±0.22 0.35±0.48 BDu 0.029±0.014 0.030±0.0070.009±0.008* 0.021±0.009* 1.8±0.85 0.85±0.48 BABDu 0.006±0.006*0.036±0.058 2.8±1.66 1.35±0.85 Pyrimidines BABD-LPu 0.007±0.009* 0.023±0.012* 1.6±0.91 0.35±0.47 BD 0.006±0.003 0.004±0.0030.004±0.002*0.004±0.002 2.0±0.77 14.4±4.39 BABD *0.003±0.002*0.008±0.008 1.8±0.53 9.75±3.36 BABD-LP *0.004±0.002*0.009±0.016 2.0±0.84 3.65±1.53 BDu 0.006±0.003 0.004±0.0030.002±0.001*0.003±0.003 6.5±4.56 16.5±5.43 BABDu 0.001±0.001*0.003±0.003 6.4±5.13 27.2±9.11 Triazines BABD-LPu *0.001±0.002* 0.010±0.018 2.3±0.71 3.6±1.28 BD 0.186±0.051 0.237±0.1390.138±0.067*0.263±0.080 3.0±1.07 12.6±3.61 BABD 0.147±0.033*0.312±0.210 2.0±0.57 0.75±0.76 BABD-LP 0.139±0.057*0.235±0.115 2.3±0.90 0.30±0.64 BDu 0.186±0.051 0.237±0.1390.031±0.016*0.257±0.064 18±11.5 13.0±4.5 BABDu 0.039±0.025*0.246±0.147 14±10.4 7.90±5.71 Phenetylamines BABD-LPu 0.016±0.028*0.239±0.080 6.5±2.6 1.00±0.63 BD 4.45±0.52 6.96±2.07 4.14±0.47 7.05±1.87 6.0±2.0 2337±2631 BABD 4.29±0.49 7.78±3.47 6.0±2.0 931±1263 BABD-LP 4.19±0.45 6.84±1.63 5.0±1.6 131±54.3 BDu 4.45±0.52 6.96±2.07 3.21±0.32* 6.18±1.61* 77±91.9 5686±8937 BABDu 3.40±0.44* 6.26±1.63* 46±57.7 2195±2901 Orange Juice BABD-LPu 3.83±0.43* 6.86±1.68 11±5.77 122±64 of the computation time for BD, BDu, BABD, and BABDu are considerably large. In the case of BD or BDu, for some data sets only several input variables are deleted at a time. This leads to slowing down variable selection more than ten times compared to that for the other data sets. In the case of BABD or BABDu, although the num- ber of useful variables is small, the approximation error is not reduced very much by block addition for some data sets. Therefore, block addition fails and we need much time in BABD or BABDu. Figures 1 (a) and (b) show, respectively, the approximation error and the number of variables selected as the computation proceeds for the data set that shows the median computation time among 20 data sets. From these figures, keeping approximation error lower than the initial error, the proposed methods can select useful input variables faster than BD. Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion209 Table 5 shows the results. In all cases we use linear kernels. The “Before” and “After” columns list the rec- ognition rates of the validation data set before feature selection and after feature selection, respectively. The “Num.” column shows the number of features selected. According to our experiments, only for the Orange juice data set the computation time of BABD and BABDu changes considerably by changing the value of A. But for BABD-LP and BABD-LPu there is not much change in computational time. For the ionosphere problem, the numbers of features selected by BD and BABD are only two but by the NLPSVM it is 11.2. And the recognition rates by BD and BABD are higher. By BDu, the number of features in- creased to 11, which is comparable to that by the NLPSVM but the recognition rate is higher by 1.2%. By BABDu, the recognition rate is lower than those by BD, BDu, and BABD, but still higher than that by the NLPSVM. Comparing the threshold updating method with the associated fixed threshold method, although the selected input variables increase, the approximation error of the validation set decreases for every data set and the test errors are comparable or better. From the standpoint of the approximation errors, the three variable selection methods are comparable. 5.6. Comparison with Other Methods For the Bupa liver problem, any of BD, BDu, BABD, and BABDu does not delete any features because that will lower the recognition rate. If we delete one more feature (feature 6) violating the threshold, the recognition rate is 68.7%, which is close to that by the NLPSVM. Comparing the numbers of selected variables shown in Table 4 with those in the “Selected” column of Table 1, we find that except for the Boston 14 problem, the fixed threshold methods give an equal or smaller number of selected variables. Especially for the phenetylamines problem, the difference is significant. For the Pima Indians problem the numbers of selected features by BD and BABD are three and the recognition rate is slightly higher than by the NLPSVM. By BDu the number of selected features is six, which is about one feature larger than that by the NLPSVM, but the recogni- tion rate is 0.8% higher. And by BABDu, the number of selected features is four and the recognition rate is higher than by the NLPSVM. Therefore, for the Bupa liver problem, feature selection performance of BD, BDu, BABD, and BABDu is comparable with that of the NLPSVM but for the ionosphere and Pima Indians prob- lem, feature selection performance of BD, BDu, BABD, and BABDu is better than by the NLPSVM. Now we compare our methods with the NLPSVM [9]. The NLPSVM is a classifier based on an imbedded me- thod, in which feature selection is performed during training. The results of the NLPSVM are obtained from [9]. The training time was measured using a 400 MHz Pentium II machine and tenfold cross-validation was used. To compare our method with the NLPSVM, we do not use BABD-LP and BABD-LPu because the numbers of features of the three problems that we use are relatively small and preselection by the LPSVR will incur addi- tional training time. Since the NLPSVM is a classifier we modify BD, BDu, BABD, and BABDu so that they handle pattern classification problems. We use fivefold cross-validation and the training time is measured by a workstation (3.6 GHz, 2GB memory, Linux operating system). The weak point of the proposed methods compared to the NLPSVM is training time. Although measuring con- ditions and computers used are different, if the computa- tion time for NLPSVM includes all the time for obtain- ing the results, the proposed methods are much slower for these problems. Table 5. Performance comparison with NLPSVM for classification data sets. Data Before [%] After [%] Num. Time [s] NLPSVM － 88.0 11.2 2.4 BD 87.5 88.6 2 72 BABD 87.5 88.6 2 164 BDu 87.5 89.2 11 79 Ionosphere BABDu 87.5 88.3 9 165 NLPSVM － 68.8 4.8 1.13 BD 71.0 71.0 6 4 BABD 71.0 71.0 6 9 BDu 71.0 71.0 6 4 BUPA Liver BABDu 71.0 71.0 6 9 NLPSVM 77.2 77.1 4.9 1.07 BD 77.2 77.3 3 120 BABD 77.2 77.3 3 200 BDu 77.2 77.9 6 111 Pima Indians BABDu 77.2 77.5 4 150 Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion 210 5.7. Discussions From the experiments of the fixed threshold methods, BD, BABD, and BABD-LP are shown to have almost the same variable selection ability in the number of de- leted variables. As for the computation time, BD and BABD are efficient for the low dimensional data sets. On the other hand, BABD-LP can perform high-speed vari- able selection for the high-dimensional data sets. From these results, we can conclude that compared to BD, BABD and BABD-LP can delete almost the same num- ber of input variables with shorter computation time. From the experiments of the threshold updating method focused on improving the approximation accu- racy, BABD and BABD-LP can delete redundant input variables for the Mackey-Glass data set with added noisy variables. Furthermore, for the other data sets, the ap- proximation error of the validation set and that of the test data set are improved by variable selection and are better than those of the fixed threshold methods. The approxi- mation accuracies of BABD and BABD-LP with the threshold updating method are nearly equal. But BABD-LP completes variable selection faster than BABD for the high-dimensional data sets. For the proposed methods, we introduce user-pa- rameter A, which determines the maximum number of variables for addition. We set A = 3 for the low dimen- sional data sets, which have less than a hundred input variables and A = 5 for more than one hundred. From the experiments, if 2A does not exceed the number of input variables, the computation time does not change very much for the data sets with less than a hundred input variables. But for the orange juice data sets, computation time changes significantly by the value of A. Therefore, in some cases, we need to set A appropriately. For BABD-LP with the fixed threshold and threshold updating methods, deleting variables with zero weights works very well for the benchmark data sets with a large number of inputs. If we want to delete more variables we may delete variables whose weights satisfy: 1, , ||max| | i jm wj w (13) where is a user-defined parameter to determine the rate of deletion by the LPSVR. 6. Conclusions In this paper, we proposed two variable selection meth- ods and the threshold updating method. The first method selects variables by block addition and block deletion. At first, we add one input variable to the empty set tempo- rarily and generate variable ranking in the ascending order of the approximation errors. We add several vari- ables at a time until the approximation error is smaller than the threshold value. After the addition is finished, we deleted redundant variables at a time in the variable set so long as the approximation error is below the threshold value. The second method preselects variables by an LPSVR before block addition and block deletion. After training the LPSVR, we delete input variables with zero weights. Then we compare the approximation error of the selected set of variables with the threshold. If the error is smaller than the threshold value, we delete vari- ables by block deletion. Otherwise, we add variables by block addition and then delete variables by block dele- tion. In the above methods, the threshold is fixed to the approximation error evaluated using the initial set of variables or updated during variable selection. The computer experiments using benchmark data sets showed that the proposed methods could finish variable selection faster than the block deletion method and the approximation accuracy of the selected set of variables is comparable with that of the initial set of variables. In addition, the threshold updating method could improve the approximation accuracy more than the fixed thresh- old method. REFERENCES [1] V. N. Vapnik, “Statistical Learning Theory,” John Wiley & Sons, New York, 1998. [2] S. Abe, “Support Vector Machines for Pattern Classifica- tion,” 2nd Edition, Springer-Verlag, New York, 2010. [3] K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V. Vapnik, “Predicting Time Series with Support Vector Machines,” In: W. Gerstner, A. Germond, M. Hasler and J. D. Nicoud, Eds., Proceedings of the 7th International Conference on Artificial Neural Networks (ICANN '97), Springer-Verlag, Berlin, 1997, pp. 999-1004. [4] J. A. K. Suykens, “Least Squares Support Vector Ma- chines for Classification and Nonlinear Modeling,” Neu- ral Network World, Vol. 10, No. 1-2, 2000, pp. 29-47. [5] V. Kecman, T. Arthanari and I. Hadzic, “LP and QP Based Learning from Empirical Data,” Proceedings of International Joint Conference on Neural Networks (IJCNN 2001), Washington, DC, Vol. 4, 2001, pp. 2451- 2455. [6] G. M. Fung and O. L. Mangasarian, “A Feature Selection Newton Method for Support Vector Machine Classifica- tion,” Computational Optimization and Applications, Vol. 28, No. 2, 2004, pp. 185-202. [7] S. D. Stearns, “On Selecting Features for Pattern Classifi- ers,” Proceedings of International Conference on Pattern Recognition, Coronado, 1976, pp. 71-75. [8] P. Pudil, J. Novovičová and J. Kittler, “Floating Search Methods in Feature Selection,” Chemometrics and Intel- ligent Laboratory Systems, Vol. 15, No. 11, 1994, pp. Copyright © 2010 SciRes. JILSA Fast Variable Selection by Block Addition and Block Deletion Copyright © 2010 SciRes. JILSA 211 1119-1125. [9] J. Bi, K. P. Bennett, M. Embrechts, C. Breneman and M. Song, “Dimensionality Reduction via Sparse Support Vector Machines,” Journal of Machine Learning Re- search, Vol. 3, No. 7-8, 2003, pp. 1229-1243. [10] T. Nagatani and S. Abe, “Backward Variable Selection of Support Vector Regressors by Block Deletion,” Proceed- ings of International Joint Conference on Neural Net- works (IJCNN 2007), Orlando, FL, 2007, pp. 1540-1545. [11] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, Vol. 46, No. 1-3, 2002, pp. 389-422. [12] S. Abe, “Neural Networks and Fuzzy Systems: Theory and Applications,” Kluwer, 1997. [13] D. Harrison and D. L. Rubinfeld, “Hedonic Prices and the Demand for Clean Air,” Journal of Environmental Eco- nomics and Management, Vol. 5, 1978, pp. 81-102. [14] Delve Datasets, http://www.cs.toronto.edu/~delve/data/ datasets.html [15] D. François, F. Rossi, V. Wertz and M. Verleysen, “Re- sampling Methods for Parameter-Free and Robust Feature Selection with Mutual Information,” Neurocomputing, Vol. 70, No. 7-9, 2007, pp. 1276-1288. [16] A. Asuncion and D. J. Newman, 2007. “UCI Machine Learning Repository,” http://www.ics.uci.edu/~mlearn/ MLRepository.html [17] L. Song, A. Smola, A. Gretton and K. M. Borgwardt, “Supervised Feature Selection via Dependence Estima- tion,” NIPS 2006 Workshop on Causality and Feature Selection, Vol. 227, 2007. [18] “Milano Chemometrics and QSAR Research Group,” http://michem.disat.unimib.it/chm/download/download.htm [19] A. Rakotomamonjy, “Analysis of SVM Regression Bounds for Variable Ranking,” Neurocomputing, Vol. 70, No. 7-9, 2007, pp. 1489-1501. [20] “UCL Machine Learning Group,” http://www.ucl.ac.be/ mlg/index.php?page=home [21] F. Rossi, A. Lendasse, D. François, V. Wertz and M. Verleyse, “Mutual Information for the Selection of Rele- vant Variables in Spectrometric Nonlinear Modeling,” Chemometrics and Intelligent Laboratory Systems, Vol. 80, No. 2, 2006, pp. 215-226. |