Considering the problem of feature selection in linear regression model, a new method called LqCP is proposed simultaneously to select variables and favor a grouping effect, where strongly correlated predictors tend to be in or out of the model together. LqCP is based on penalized least squares with a penalty function that combines the
L
q (0
norm="" and="" correlation-based="" penalty="" that="" is="" cp="" norm.="" it="" can="" shrink="" some="" coefficients="" to="" exactly="" zero="" additionally="" the="" term="" links="" strength="" of="" penalization="" correlation="" among="" predictors.="" simulation="" studies="" show="" advantages="" lqcp="" with="" increase="" noise="" variables="" case="" p="">
n. In addition, a simulation about grouped variable selection is performed. Finally, The model is applied to two real data: US Crime Data and Gasoline Data. In terms of prediction error and estimation error, empirical studies show the efficiency of LqCP.
Here the usual linear regression mode is considered in the paper given by:
y = X β + ϵ , (1)
where y n × 1 are the observations, β p × 1 is a vector of unknown parameters to be estimated, X n × p is an n × p matrix of p predictor vectors of n observations and ϵ is a random error vector with E ( ϵ ) = 0 which often is assumed that it is subject to normal distribution independently. Ordinary least squares (OLS) estimates are very common which can be obtained by minimizing the sum of square residuals. In general, the criteria for evaluating the quality of a model from the following two aspects. One is prediction accuracy on test data and the other is to tend to select a simple model. In other words, less variables would be selected in the condition of same prediction effects. Variable selection is necessary especially when the number of predictors is large. There are many applications using variable selection to solve problems like knowledge discovery with high-dimensional data source [
In recent years, regularization method has attracted a great attention. It is used in applications such as machine learning, denoising, inpainting, deblurring, compressed sensing, source separation and more. Generally, the square loss function of penalized least squares estimates:
L ( β ; λ ) = ‖ y − X β ‖ 2 2 + λ ‖ β ‖ q q , (2)
where λ > 0 is a penalty parameter. q > 0 is considered in this paper.
‖ β ‖ q q = ∑ j = 1 p | β j | q . When q = 1 , it becomes Lasso procedure. The procedure of
minimizing the objective function is called ridge regression when q = 2 . As a continuous shrinkage method, ridge regression achieves its better prediction performance through a bias-variance trade off. However, ridge regression can not produce parsimonious model, which means all the predictors are kept in the model. Lasso is proposed by [
Although lasso is a popular method for variable selection, it still has several drawbacks. The first is the lack of oracle property. The oracle property means the probability of selecting the right set of nonzero coefficients converges to one, and the estimators of the nonzero coefficients have asymptotically normal distribution with the same means and covariances. Fan and Li (2001) [
In the same spirits, there existed other penalty based on methods for handling grouping effects. Penalizing least squares via combining L 1 and L ∞ named OSCAR is presented by Bondell and Reich (2008) [
In this paper, motivated by the sparsity and grouping effect especially the case of the pairwise correlations are very high, a new regulation procedure called LqCP is proposed in linear regression setting. It combines the L q ( 0 < q < 1 ) norm and CP penalty. Similar to the L1CP method, LqCP also performs automatic variables selection and allows to select or to remove highly correlated variables together. Section 2 first introduces the property of elastic net and adaptive elastic net and then demonstrates LqCP and its algorithm as well as the method of choosing regularization parameters. In Section 3, simulation studies give the estimation mean squared errors for different circumstances to show the parameters estimation effects among LqCP, elastic net (ENET), adaptive elastic net (AENET) and L1CP. Section 4 is mainly about the application examples to show the prediction accuracy of models. The conclusion of this paper is given in Section 5.
Here the form of elastic net described in the above firstly is showed in the following. The naive elastic net estimator β ^ ( naive ) is the minimizer of equation:
β ^ ( naive ) = arg min β ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ + λ 2 ‖ β ‖ 2 2 . (3)
This method is called the naive elastic net which overcomes the limitations of Lasso in the case of p > n . The penalty of combining L 1 norm and L 2 norm which has been proved that it has grouping effect in Zou and Hastie (2005) [
In a similar way to Lasso, the elastic net does not enjoy the oracle property. Combining the property of the adaptive Lasso, Zou and Zhang [
β ^ ( Adaenet ) = arg min β ‖ y − X β ‖ 2 2 + λ 1 ∑ j = 1 p ω ^ j | β j | + λ 2 ‖ β ‖ 2 2 , (4)
where ω ^ j = ( | β ^ j | ) − γ , j = 1 , 2 , ⋯ , p , and γ is a positive constant. Choosing the
initial weight is crucial in the adaptive elastic net. Zou and Zhang [
In the context of linear regression problems, the following penalty function based on residual sum squares is considered. The L q penalty on f is defined as
‖ f ‖ q q = ∑ j = 1 p | β j | q . (5)
When q = 0 , the corresponding penalty is discontinuous at the origin and consequently is not easy to compute. Thus in this paper q > 0 is designed. The least squares subject to the L q penalty with 0 < q < 1 is first studied by Frank and Friedman (1993) [
So according to better sparsity property of L q ( 0 < q < 1 ) and the case of high correlations of variables, the model is defined by
β ^ = arg min β ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ q q + λ 2 P c ( β ) , (6)
where
P c ( β ) = ∑ j = 1 p ∑ i > j ( β i − β j ) 2 1 − ρ i j + ( β i + β j ) 2 1 + ρ i j . (7)
0 < q ≤ 1 , λ 1 , λ 2 are positive constants. ρ i j denotes the (empirical) corre- lation between the ith and jth predictors. Here the penalty P c ( β ) is introduced by Tutz and Ulbricht (2009) [
As introduced in the above, L q penalty in the loss of residual square sum have a better sparsity, while the correlation-based penalty P c ( β ) will en- courage grouping effect for highly correlated variables. In fact, it’s easy to see that for strong positive correlation ( ρ i j ≈ 1 ) in (7), the first term becomes the dominant having the effect that estimates for β i and β j are similar ( β ^ i ≈ β ^ j ). For strong negative correlation ( ρ i j ≈ − 1 ), the second term becomes dominant and β i will be close to β j . The effect is grouping, highly correlated effects show comparable values of estimates ( | β ^ i | ≈ | β ^ j | ) with the sign being determined by positive or negative correlation.
Moreover, assuming that ρ i j ≠ 1 for i ≠ j , the penalty (7) can be written in a sample quadratic form P c ( β ) = W T β W , where W = ( w i j ) , 1 ≤ i , j ≤ p , is a positive definite matrix with general term
W i j = ( 2 ∑ s ≠ i 1 1 − ρ i s 2 if i = j − 2 ρ i j 1 − ρ i j 2 if i ≠ j (8)
Putting γ = λ 2 / ( λ 1 + λ 2 ) ,the optimization problem (6) is equivalent to
β ^ = arg min β ‖ y − X β ‖ 2 2 , s .t ( 1 − γ ) ‖ β ‖ q q + γ P c ( β ) ≤ ν for ν ≥ 0. (9)
Proof of (8): In fact, the penalty Q ( β ) = λ 1 ‖ β ‖ q q + λ 2 P c ( β ) can be written as follows:
Q ( β ) = λ 1 ‖ β ‖ q q + λ 2 P c ( β ) = ( λ 1 + λ 2 ) { λ 1 λ 1 + λ 2 ‖ β ‖ q q + λ 2 λ 1 + λ 2 P c ( β ) } = λ { ( 1 + γ ) ‖ β ‖ q q + γ P c ( β ) } (10)
where λ = λ 1 + λ 2 . So, the problem (6) is equivalent to finding
β ^ ( λ , γ ) = arg min β ‖ y − X β ‖ 2 2 + λ { ( 1 − γ ) ‖ β ‖ q q + γ P c ( β ) } (11)
which is equivalent to the optimization problem (6).
The estimators of β can be computed via the Cyclic Descent Algorithm for l q sparsity penalized linear regression problem [
Indeed, the optimization problem (6) can be written as
β ^ ( λ 1 , λ 2 ) = arg min β ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ q q + λ 2 β T W β , (12)
where W is defined by (8), is a real symmetric positive-define square matrix,
assuming that ρ i j 2 ≠ 1, with Choleski decomposition W = L L T and L = W 1 2 ,
which always exists. Now, let
X ( n + p ) × p * = ( X λ 2 L p T ) , y ( n + p ) * = ( y 0 ) . (13)
The LqCP estimator is defined as
β ^ = arg min β ‖ y * − X * β ‖ 2 2 + λ 1 ‖ β * ‖ q q . (14)
Let
J ( β ) = ‖ y − X β ‖ 2 2 + λ 1 ‖ β ‖ q q + λ 2 ‖ β ‖ 2 2 = [ y T y + β T X T X β − 2 y T X β ] + λ 2 β T W β + λ 1 ‖ β ‖ q q = [ y T y + β T ( X T X + λ 2 L L T ) β − 2 y T X β ] + λ 1 ‖ β ‖ q q = ‖ y * − X * β ‖ 2 2 + λ 1 ‖ β ‖ q q . (15)
Note that the sample size in the augmented problem is n + p and X * has rank p . As described in the paper of Marjanovic and Solo [
J ( β ) = ‖ y * − X * β ‖ 2 2 + λ 1 ‖ β ‖ q q = [ ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j − x i k * β k ) 2 ] + λ 1 ∑ j ≠ k | β j | q + λ 1 | β k | q = [ ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) 2 + ∑ i = 1 n x i k * 2 β k 2 − 2 ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) x i k * β k ] + λ 1 ∑ j ≠ k | β j | q + λ 1 | β k | q = ∑ i = 1 n x i k * 2 [ β k 2 + ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) 2 ∑ i = 1 n x i k * 2 − 2 ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) x i k * β k ∑ i = 1 n x i k * 2 ] + λ 1 ∑ j ≠ k | β j | q + λ 1 | β k | q = ∑ i = 1 n x i k * 2 [ β k 2 − 2 z k β k + z k 2 ] + λ 1 | β k | q + C (16)
Here, the related values are defined that z k = ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) x i k * ∑ i = 1 n x i k * 2 , C = ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) − [ ∑ i = 1 n ( y i * − ∑ j ≠ k x i j * β j ) x i k * ] 2 ∑ i = 1 n x i k * 2 + λ 1 ∑ j ≠ k | β j | q , λ * = λ 1 ∑ i = 1 n x i k * 2 . Then,
min β J ( β k ) = ∑ i = 1 n x i k * 2 [ ( z k − β k ) 2 + λ * | β k | q ] . (17)
Through the whole procedure, calculating the model (12) has been trans- formed to (15). Then, the algorithm procedure is stated below:
Initialize with β 1 and calculate the initial residual r 1 : = y * − X * β 1 . Denote the iteration counter by m and the coefficient index of an iterate by k . Set m = k = 1 , and:
(a) Calculate the adjusted gradient z k m : = z ( β − k m ) by:
z k m = X ( k ) * T ( y * − X * β − k m ) X ( k ) * T X ( k ) * = X ( k ) * T ( y * − X * β m + X ( k ) * T X ( k ) * β n ) X ( k ) * T X ( k ) * = X ( k ) * T r m X ( k ) * T X ( k ) * + β k m (18)
(b) Use (a) computing the map:
τ ( z k m , β k m ) : = ( 0 if | z k m | < h λ * , q sign ( z k m ) b λ * , q I ( β k m ≠ 0 ) if | z k m | = h λ * , q sign ( z k m ) β ¯ if | z k m | > h λ * , q (19)
where b λ * , q = [ 2 λ * ( 1 − q ) ] 1 2 − q , h λ * , q = b λ * , q + λ * q b λ * , q q − 1 and β ¯ > 0 satisfies
β ¯ + λ * q β ¯ q − 1 = | z k m | . There are two solutions to this equation and β ¯ ∈ ( b λ * , q , | z k m | )
is the larger one. It can be computed from the iteration, b λ * , q ≥ β ¯ ( 0 ) ≤ | z k m | :
β ¯ ( t + 1 ) = ρ ( β ¯ ) where ρ ( β ¯ ) = | z k m | − λ * q β ¯ q − 1 (20)
(c) Update β m by:
β m + 1 = β − k m + τ ( z k m , β k m ) e k (21)
where e k has a 1 in the k-th position and 0's in the rest.
(d) Update the residual r m with:
r n + 1 = y * − X * β m + 1 = y * − X * ( β − k m + β k m + 1 ) = ( y * − X * β − k m − β k m x ( k ) * ) + β k m x ( k ) * − β k m + 1 X * e k = r m − ( β k m + 1 − β k m ) x ( k ) * (22)
(e) Update the iteration counter m by m = m + 1 .
(f) Update the coefficient index k by:
k = ( p if 0 ≡ m mod p m mod p otherwise (23)
(g) Go to (a)
In practice, it is important to select appropriate tuning parameters in order to obtain a good prediction precision or estimation precision. There are three parameters ( λ 1 , λ 2 , q ) which need to be chosen. As mentioned in the previous section, how to choose a proper q is important, which depends on the nature of data. If sparsity of model is the point of focus, smaller q tends to be proper. The aim of this paper is not only to research the sparsity of variables but also to study grouping effect about strongly correlated variables. Therefore, the best q ( 0 < q < 1 ) should be chosen by experiment data and cross-validation.
Firstly, q ( 0 < q < 1 ) is given a grid of values to be compared. The choice of ( λ 1 , λ 2 ) will be different for different q . For each fixed q , cross-validation is used to choose the best ( λ 1 , λ 2 ) . Typically, given a grid of values for λ 2 , 5-fold cross-validation is applied to getting the best λ 1 in terms of minimized mean squared error of models or prediction error for each fixed λ 2 . In simulation studies, the mean square error are defined by
MSE = ( β ^ − β ) T E ( X T X ) ( β ^ − β ) , (24)
where E ( X T X ) is approximated by the sample covariance matrix of X of out-of-sample data. The formula of MSE is mentioned by Tibshirani [
TestError = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 (25)
can measure prediction error of model. At last, the chosen value ( q , λ 1 , λ 2 ) is the best to compare with other models.
In this section, simulation studies are presented the finite sample performance of LqCP. The results analysis are considered from variable selection ability, the estimation errors, grouping effect. Data is generated from the true model:
y = X β + σ ϵ , ϵ ∼ N ( 0,1 ) . (26)
y is the response variable and X is an n × p matrix with p predictor vectors and n observations. ϵ is a random error vector with E ( ϵ ) = 0 . β is p dimension parameters and σ expresses the volatility of y . Three methods in the simulations study: the elastic net (ENET), the adaptive elastic net (AENET) and the L1CP are listed to be compared. Because these methods have the ability of grouped variable selection. Data is divided into two data sets: training data and testing data. Training data is used to do model fitting and cross-validation. Testing data is used to evaluate the error of models. For each estimator β ^ , its estimation accuracy is measured by the mean squared error (MSE) in the testing data. The variables selection performance is gauged by C and IC, where C is the number of zero coefficients that are correctly estimated by zero and IC is the number of non-zero coefficients that are incorrectly estimated by zero. In addition, the algorithm’s stopping criterion is
‖ β n + 1 − β n ‖ 2 ≤ 10 − 4 . In these simulation studies, q is given five different values
which are q = ( 0.1 , 1 3 , 1 2 , 2 3 , 0.9 ) . The results in the following tables are chosen
by comparing MSE among different q . ENET and AENET are computed by “glmnet” package and L1CP is computed by “lars” package in R language .
Example 1. 100 data sets are generated with sample size 100 observations from the linear regression model y = X β + σ ϵ , ϵ ∼ N ( 0,1 ) , The true regression coefficients vector β is β = ( 3 , − 2 , 1.5 , 0 , ⋯ , 0 ) . Different dimension levels are considered as below. Smaller value σ = 2 is more proper and can make all results stable through the simulation experiment. The correlation matrix is given by ρ ( x i , x j ) = 0.5 | i − j | . The observation sample size of training data is 70 and testing data is 30. Furthermore, considering the problem of sparsity, noise variables increase by increasing the number of zero in β to generate four different dimensional data. The dimensions of data are respectively 10 (30%), 30 (10%), 60 (5%), 100 (3%). The number of noise variables are respectively 7, 27, 57, 97. That is to say, this part concerns that how the results change with the increase of noise variables. The simulation results are in
Example 2. This example is about the case of p > n . Similar to the example of Wu, Shen and Geyer (2009), corresponding to high dimensional simulation scenarios with correlated groups, large p and small n , the X i are simulated from N ( 0, Σ ) , where the jk-th element of Σ is 0.5 | j − k | and
β = ( 3 , 3 , 3 , 3 , 3 , − 2 , − 2 , − 2 , − 2 , − 2 , 1.5 , 1.5 , 1.5 , 1.5 , 1.5 , 1 , 1 , 1 , 1 , 1 , 0 180 ) .
So there are 20 grouped relevant predictors and 180 noise predictors and n = 100 . Also, 100 data sets are generated and split data 70/30 into two parts for training data and testing data.
Example 3. About the grouping effect, 100 sample size described by 40 pre- dictors is considered. The true parameters are
Methods | 30% level | 10% level | 5% level | 3% level |
---|---|---|---|---|
ENET | 0.521 (0.0314) | 0.957 (0.0722) | 1.501 (0.1342) | 1.998 (0.1744) |
AENET | 0.166 (0.0280) | 0.242 (0.0288) | 0.535 (0.0680) | 0.618 (0.1467) |
L1CP | 0.495 (0.0462) | 0.824 (0.0543) | 1.007 (0.0463) | 1.336 (0.0724) |
LqCP | 0.200 (0.0309) | 0.248 (0.0323) | 0.372 (0.0573) | 0.382 (0.1472) |
30% level | 10% level | 5% level | 3% level | |||||
---|---|---|---|---|---|---|---|---|
Methods | C | IC | C | IC | C | IC | C | IC |
ENET | 2 | 0 | 18 | 0 | 44 | 0 | 79.5 | 0 |
AENET | 7 | 0 | 26 | 0 | 57 | 0 | 97 | 0 |
L1CP | 3 | 0 | 17 | 0 | 46 | 0 | 82 | 0 |
LqCP | 7 | 0 | 27 | 0 | 57 | 0 | 97 | 0 |
β = ( 2 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 3 , 4 , 4 , 4 , 4 , 4 , 0 25 )
and σ = 6 which is selected to see the performance of models in bigger volatility. The predictors are generated as:
x i = Z 1 + 0.2 ϵ i , Z 1 ∼ N ( 0,1 ) and i = 1 , ⋯ , 5
x i = Z 2 + 0.2 ϵ i , Z 2 ∼ N ( 0,1 ) and i = 6 , ⋯ , 10
x i = Z 3 + 0.2 ϵ i , Z 3 ∼ N ( 0 , 1 ) and i = 11 , ⋯ , 15
x i are independently identically distributed N ( 0,1 ) for i = 16 , ⋯ , 40 , where ϵ i are independently distributed N ( 0,1 ) for i = 1 , ⋯ , 15 . In this data, three equally important groups have pairwise correlation ρ ≈ 0.96 , and there are 25 pure noise features. Also, the data is split as 70 observations for training data and 30 observations for testing data.
From
In the case of p > n showed in
This part is about the performances of LqCP for two real world data sets: the US Crime and Gasoline described by p = 15 and p = 401 explanatory predictors respectively. The dimension p of US Crime data set is smaller than the sample
Methods | Median of MSE | C | IC |
---|---|---|---|
ENET | 3.080 (0.2201) | 151 | 0 |
AENET | 1.795 (0.1956) | 177 | 0 |
L1CP | 2.172 (0.1124) | 153 | 0 |
LqCP | 1.456 (0.0767) | 179 | 0 |
Methods | Median of MSE | C | IC |
---|---|---|---|
ENET | 8.351 (0.5367) | 17 | 1 |
AENET | 7.870 (0.4619) | 19 | 2 |
L1CP | 9.937 (0.5615) | 20 | 4 |
LqCP | 6.535 (0.4605) | 23 | 0 |
size ( n = 47 ), while the number of variables of Gaoline exceeds largely the sample size n = 69 . Because the true parameters in application is unknown and the concern is prediction accuracy of response variable. Test Error is mentioned in the above as the criterion comparing among models. The selection of q is
also based on the minimized test error for q = ( 0.1 , 1 3 , 1 2 , 2 3 , 0.9 ) .
This data set is taken from R package “MASS” which contains 47 observations and 15 variables as well as one response variable. Criminologists are interested in the effect of punishment regimes on crime rates which has been studied using aggregate data on 47 states of the USA for 1960. This data set contained the following 16 columns: percentage of males aged 14 - 24 (M), indicator variable for a Southern state (So), mean years of schooling (Ed), police expenditure in 1960 (Pol), police expenditure in 1959 (Po2), labor force participation rate (LF), numbers of males per 1000 females (M.F), state population (Pop), number of non-whites per 1000 people (NW), unemployment rate of urban males 14 - 24 (U1), unemployment rate of urban males 35 - 39 (U2), gross domestic product per head (GDP), income inequality (Ineq), probability of imprisonment (Prob), average time served in state prisons (Time) and the outcome is the rate of crimes in a particular category per head of population (y). The data is split 100 times with a training set of 24 observations and a test set of 23 observations. The results are listed in
Clearly,
This data set “Gasoline” comes from R package “pls”. It is about infrared spectrum, which contains 69 observations. Recently, infrared spectrum is based on the function of diffuse reflecting degree measured by interval 2 nm from 900 nm to 1700 nm. Gasoline data have 401 prediction variables and the correlations of variables are very high that are almost 0.99. Similarly, the data set is split 100 times into a training set of 40 observations and a test set of 29. The prediction results are reported in
In the circumstance of high dimension ( p = 401 ) and small sample obser- vations size( n = 69 ), LqCP is the winner in term of test error, which also gets the least number of variables. Significantly, in this application, the correlation of variables is very high and approaches 1. The result shows that AENET, L1CP and LqCP have similar variable selection effect but ENET is the worst. Therefore, this application proved the efficiency of LqCP from the aspect of p > n and highly correlated variables.
In this paper, motivated by variable selection and grouped selection property in linear regression problems, a new method called LqCP is proposed, which is a regularization procedure based on the penalized least squares with a mix of L q ( 0 < q < 1 ) norm and correlation-based penalty. Firstly, this paper discusses the current models that have grouping effect including elastic net, adaptive elastic net and L1CP. Similar to them, LqCP can also encourage grouping effect, where strongly correlation among predictors tend to be in or out of the model together. Through the simulation studies in the above, LqCP has better per- formances in terms of variable selection ability with large numbers of noise
Methods | Median of Test Error | Median no. of selected variables |
---|---|---|
ENET | 0.530 (0.0163) | 13 |
AENET | 0.534 (0.0143) | 10 |
L1CP | 0.537 (0.0180) | 9 |
LqCP | 0.518 (0.0145) | 7 |
Methods | Median of Test Error | Median no. of selected variables |
---|---|---|
ENET | 0.0213 (0.0006) | 32 |
AENET | 0.0217 (0.0009) | 10 |
L1CP | 0.0225 (0.0010) | 11 |
LqCP | 0.0191 (0.0007) | 8 |
variables, high dimension p > n and grouped variable selection automatically for high correlations of variables. Additionally, two real data proved LqCP’s efficiency from the aspects of p < n and p > n through comparing pre- diction error with other models. Moreover, the oracle property is important in statistics area. One of future works is to pay attention to prove the oracle property of LqCP and also LqCP can be expended to general regression like logistic regression, quantile regression to solve some regression problems or multi-calss classification problems especially for data with highly correlated variables.
Mao, N. and Ye, W.Z. (2017) Group Variable Selection via a Combination of Lq Norm and Correlation-Based Penalty. Advances in Pure Mathematics, 7, 51-65. http://dx.doi.org/10.4236/apm.2017.71005