A Graduated Nonconvex Regularization for Sparse High Dimensional Model Estimation

doi:10.4236/jcc.2014.211001

Paper Menu >>

Journal Menu >>

Journal of Computer and Communications, 2014, 2, 1-14

Published Online September 2014 in SciRes. http://www.scirp.org/journal/jcc

http://dx.doi.org/10.4236/jcc.2014.211001

How to cite this paper: Coleman, T.F. and Li, Y.Y. (2014) A Graduated Nonconvex Regularization for Sparse High Dimen-

sional Model Estimation. Journal of Computer and Communications, 2, 1-14. http://dx.doi.org/10.4236/jcc .2014.211001

A Graduated Nonconvex Regularization for

Sparse High Dimensional Model Estimation

Thomas F. Coleman, Yuying Li

1Departmentof Combinatorics and Optimization, University of Waterloo, Waterloo, Canada

2Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada

Email: tfcol eman @u water loo. ca , yuying@uwaterloo.ca

Received May 2014

Abstract

Many high dimensional data mining problems can be formulated as minimizing an empirical loss

function with a penalty proportional to the number of variables required to describe a model. We

propose a graduated non-convexification method to facilitate tracking of a global minimizer of this

problem. We prove that under some conditions the proposed regularization problem using the

continuous piecewise linear approximation is equivalent to the original

regularization problem.

In addition, a family of graduated nonconvex approximations are proposed to approximate its

continuous approximation. Computational results are presented to illustrate the performance.

Keywords

Sparse Model Estimation, Variable Selectio n

Regu l arizati on

1. Introduction

Sparsity is a desired property in model estimation since it often leads to better interpretability and out-of-sample

predictability. In [1], a risk bound is established for the model selection by minimizing the empirical loss func-

tion penalized by the number of variables needed to describe the model. In this context, model dimension is the

number of unknown variables; sparsity refers to a small number of variables selected to define the model. Thus

sparse model estimation is also sometimes referred to as variable selection.

Selecting a model with a small number variables can be formulated as minimizing an empirical loss function

with a penalization for the number of nonzero variables; this is referred to as

-regularization. Unfortunately

this is a NP-hard global optimization problem, see, e.g., [2] [3]. Relaxation of cardinality regularization has long

been used as a way to approach the sparse model selection problem, see, e.g., [4] [5]. Due to its computational

simplicity, regularization based on the

norm is popular in practice. This is referred to as ridge regression.

It has also long been recognized that

regularization often leads to sparsity in solutions, see, e.g., [6]-[11]

[29]. Recent compressive sensing theory, e.g., [12]-[14]), formally establishes that, under a certain restricted

isometry property (RIP) on a

nm×

sensing matrix

nm≤

a sparse vector

can be reconstructed ex-

actly from

Φyx=

. In addition, computational methods have been developed to obtain solutions efficiently, see,

e.g. , [15]-[17].

T. F. Coleman, Y. Y. Li

Despite its success, the

regularization approach can potentially lead to model bias for model estimations,

see, e.g., [18]-[21]. In [18], a smoothly clipped absolute deviation (SCAD) penalty is proposed using a noncon-

vex regularization to avoid the potential bias. Iterative methods are proposed in [23] [24] to approximate a local

minimizer of the SCAD penalized loss. It is proposed in [18] that a proper regularization function for model es-

timation should be chosen with three objectives: avoid bias in the resulting estimator, achieve sparsi ty in the es-

timated model, and, finally, achieve continuity or stability in model prediction. The

regularization function

is able to achieve sparsity and smoothness but can lead to model bias. In addition, using

regularization does

not always lead to the sparsest model which fits the data to a specified accuracy. Recently computational meth-

ods have also been proposed to iteratively solve

regularization problems, see, e.g., [25] [26].

The main objective of this paper is to devise a computational method to minimize the empirical loss function

with a penalization for the number of nonzero variables. We approximate the counting indicator function by a

continuous piecewise linear function. We show mathematically that a continuous piecewise linear approxima-

tion has appealing theoretical properties. To facilitate tracking a global minimizer, this continuous piecewise

function is further approximated by a family of continuously differentiable piecewise quadratic functions which

are indexed by a parameter controlling the degree of nonconvexity in the approximation. Starting from an initial

convex function, a sequence of increasingly more nonconvex approximate problems are solved, using the solu-

tion to the previous problem as a starting point for the next approximate problem. In addition, each approxima-

tion can be regarded as a regularization function for model estimation; each approximation behaves similar to

the

function near the origin (ensuring sparsity), the

function asymptotically (avoiding bias), and con-

tinuously differentiable everywhere (for smoothness). We illustrate the efficacy of the proposed method in de-

termining a sparse model for data fitting, computational efficiency of the approach, and the effect of the pa-

rameters on the gradual non-convex approximations.

2. Continuous Approximation to

Assume that

( )

measures the empirical error based on the given data. We want to solve the following

regularization problem

( )

min

xR i

fx x

∈=

+Λ

∑

(2.1)

whe r e the counting indicator function

( )

Λ⋅

( )

1 if 0

0 otherwise

x≠



Λ=





and

is a penalty parameter, balancing the objectives of minimization of

( )

and sparsity in the solu-

tion.

Unfortunately standard optimization methods cannot be applied to the data fitting problem (2.1) since the car-

dinality function

( )

is discontinuous and nonconvex. Convex relaxations have been proposed for (2.1), see,

e.g., [5] [27]; these relaxations can be sub-optimal since (2.1) is nonconvex. In [28] the counting indicator

( )

is approximated, in the context of image processing, by the piecewise quadratic

( )

and solves

( )

min

xR i

fxh x

∈=

∑

(2.2)

whe r e

( )

1 otherwise

λλ

≤



=





and

is a small resolution parameter. We propose to approximate the discontinuous counting indicator

function

()

by the following continuous

penalty function

( )

1 otherwise

λλ

≤



=





T. F. Coleman, Y. Y. Li

Figure 1 illustrates

( )

and

( )

Usin g

( )

, the

regularization problem (2.1) is approximated

()()

( )

def

min ;

xR i

pxf xhx

λµ

∈=

∑



(2.3)

Next we establish mathematically that there exists a finite threshold parameter

such that a solution to (2.3)

is a solution to the original

penalty problem (2.1) for all

λλ

≥>

Assumption 2.1. Assume

( )

is twice continuously differentiable on an open set

DR⊆

and there ex-

ists

0K>

such that

( )

Kfx∇ ≤

for all

xD∈

1, ,in=

Lemma 2.1. Suppose Assumption 2.1 holds. Let

x D

∗∈

be a local minimizer of

( )

;px



with

λµ

≥

Then, for each index

, either

( )

∗

( )

∗≥

. Hence either

( )

∗

( )

∗

for

1, ,in=

Proof. For notational simplicity, in this proof, we denote

( )

;px



. simply as

( )



Suppose the contrary, i.e., for some index i,

( )

∗

. We note that in this situation the derivative of

( )



with respect to

exists at

x∗

be-

cause

( )

is everywhere differentiable on

and since by assumption∗is not a cusp point,

( )

is also

differentiable at

()

∗

. Then, by local optimality

()

∗

∇=



, i.e.,

( )

0 or 0

fx fx

µλ µλ

∗∗

∇+=∇−=

But

()

and i

Kfx K

µλ

∗

> ∇≤

, which is contradictory.



Lemma 2.2. Let

x∗

be a local minimizer for

( )

;px



for some

λµ

. Then

x∗

is a local minimizer of

( )

; for all px

λ λλ

≥



Proof. Suppose that

x∗

is not a local minimizer of

( )

;px



for some

λλλ

= >

. Lemma 2.1 states that

the components of

x∗

are partitioned into two sets: i.e., for each indexeither

() 0

∗

()

∗

≥

Since

λ λ

, the same partition applies with respect to

. Therefore

Figure 1. Quadratic (left subplot) and piecewise linear (right sunplot) approximations to the counting indicator function

( )

zΛ

with λ = 100.

T. F. Coleman, Y. Y. Li

( )( )

() ()

hxh x

λλ

∗∗

= =

∑∑

It follows from the definition for

( )

;px



that

x∗

cannot be a minimizer of

( )

;



, a contradiction.

■

Theorem 2.1. If

x D

∗

∈

is a strong local minimizer for

( )

;px



for some

λµ

. Then

∗

is a local

minimizer of

( )( )

( )

pxf xx

=+Λ

∑

Proof. By Lemma 2.2,

x∗

is a local minimizer of

( )

;px



for all

sufficiently large. We show that

x∗

is a local minimizer of

()px

by contradiction. Suppose that∗is not a local minimizer of

()px

. Then there

exists a sequence

converging to

xx x

∗∗

≠

and

( )

()

px px

∗

. This implies that

( )

fx fx

∗

for

sufficiently large

. Since, for sufficiently large

()( )

;

px px



And

()( )

;px px

∗∗



, we have that

( )

; ;.

px px

λλ

∗



This contradicts that∗is a strong local minimizer

( )

;px



for some

λµ



If one uses

( )

to approximate

( )

zΛ

, a minimizer

x∗

to the approximation problem (2.2) will gener-

ally not be a minimizer to the

regularization problem (2.1) unless either

() 0

x∗=

()

∗≥

, which

typically does not hold for any

. Theorem 2.1 indicate that our proposed approximation

( )

is supe-

rior to

( )

in solving the

regularization problem (2.1).

3. Graduated Non-Convex ification

We now address a couple of additional challenges. Firstly,

( )

is not differentiable everywhere, which is

also the case for

( )

. Secondly,

( )

∑

is not convex; thus problem (2.3) has many local minimizers.

Ass ume

{ }

1, 2,, n⊂= 

. Consider the following

( )

minsubject tox0,

fx i

∈

= ∈

(3.1)

Then any local minimizer of (3.1) is a local minimizer of (2.1) for a fixed

For a given

, computing a global minimizer of the

regularization problem (2.1), or the proposed

approximation (2.3), is NP-hard. However, the quality of the estimated model depends on being able to find, as

much as possible, a sufficiently good approximation to the global minimizer of (2.1). Next we develop a com-

putational method to produce a good approximation to the global minimizer of the piecewise linear minimiza-

tion problem (2.3).

Assume that the empirical loss function

( )

is convex.

Hence the nonconvexity comes from the counting indicator function. In [28], a graduated non-convexification

process is proposed, in the context of image processing, in an attempt to find the global minimizer of the piece-

wise quadratic approximation (2.2) as follows. The continuous function

( )

is approximated using a family

of continuously differentiable piecewise quadratic functions

( )

, where

( )

ˆ(; )1

1 otherwise

zif z

g zzifz

λκ

ρ γκγ

≤



=−−≤≤







(3.2)

21 1

, .

γκ

ρ λλγ

=+=

(3.3)

Here

is a parameter indexing the family of approximations to

( )

. The function

( )

ˆ;gz

is a

piece wise quadratic, with the concave quadratic

T. F. Coleman, Y. Y. Li

( )

12z

ργ

−−

when

z( ),

κ γ

∈

. Note that, for any

, we have

γλ

. Thus

κγ

. In addition the function

( )

ˆ;

is continuously differentiable and symmetric with respect to

. Left subplot in Figure 2 graphically

illustrates the function

( )

ˆ;gz

. Substituting

( )

ˆ;gz

for

( )

, we have

()

min ;

xR i

fxg x

µρ

∈=







∑

(3.4)

Let

{ }

be a given monotonically increasing sequence which converges to

+∞

. In [28], the global mini-

mizer of the piecewise quadratic minimization (2.2) is tracked by solving a sequence of problems (3.4) indexed

by a monotonically increasing sequence

{ }

, using the solution of the

( )

1k−

th problem as the starting point

for the kth problem. Initially, the minimizer for the empirical error function, which corresponds to (3.4) with

, is computed. As

→

, and

→ +∞

. Thus problem (3.4) approaches the minimization of

the empirical error function min

( )

→

. As

increases, the second order derivative of the

quadratic function in

[ ]

κγ

equals

−

, which becomes increasingly more negative, gradually introducing

non-convexity. In addition, as

,, λ

k kk

ρ γκ

→ +∞→

, and the function

( )

ˆ;

approaches

( )

. Geo-

metrically, the convex empirical error function

( )

is gradually deformed to the concave function, with the-

computed solution sequence following a path from the minimizer of

( )

to, ideally, the global minimizer of

( )

fxh x

∑

Similarly we can design a family of approximations to track the global minimizer of

( )

fxh x

∑

(2.3). For sparsity, we want to retain the

segment

in a neighborhood of

. Outside this

neighborhood, we require the approximation to be continuously differentiable. Finally, similar to

( )

ˆ;

we ensure that this approximation gradually approaches the piecewise linear

( )

Based on

( )

ˆ;gz

in (3.2), we construct a family of approximations for

( )

with the desired properties

as follows. Let

and

be defined in (3.3). First we select two break points

and

so that

ξηκ

<<<

and

ηξ κη

−=−

In addition,

and

monotonically increase to

converges to

+∞

; this property also holds

Figure 2. Graduated nonconvex approximation

( )

ˆ;gz

(left subplot) and

( )

;gz

(right subplot).

T. F. Coleman, Y. Y. Li

for

. Then we construct the unique quadratic spline

( )( )( )

{ }

,Szs zsz=

[ ]

ξη

( )()

( )

when

sz z

Sz sz z

ξη

ηκ

 ≤≤



=≤≤





which satisfies the following boundary conditions: the function values and the derivative values at

and

are given by

and

( )

12z

ργ

−−

respectively, i.e.,

()( )

()( )()( )

, ,

1, .

ξξξ

γγ

κκγκ ρκγ

′

= =

′

=− −=−−

(3.5)

We now approximate the nondifferentiable piecewise linear functi on

( )

when

[

)

0,z∈ +∞

by the fol-

lowing continuously differentiable function

( )

;gz

below:

()

( )

1if

;if

1 if

1 otherwise

sz z

gz sz z

ζη

ρηκ

ργκγ

≤



≤≤



=≤≤





− −≤≤





(3.6)

where

and

are defined in (3.3). Subplot (b) in Figure 2 graphically illustrates the function

( )

λ;k

By construction,

( )

;

is an even function. In addition, it is continuously differentiable on

( )

0,+∞

. We

now establish the monotonicity property for

( )

λ;k

whe n

[

)

0,z∈ +∞

Lemma 3.1. Let

( )

be the quadratic spline in

[ ]

ξκ

with the breakpoints

ξηκ

<<<

and the

boundary conditions (3.5) restated below

( )( )

( )()( )()

, ,

1, .

ξξξ

γγ

κκγκ ρκγ

′

= =

′

=− −=−−

where

and κare defined in (3.3). Assume

and

ηξ κη

−=−

. Then

( )

is strictly monotonically

increasing on

[ ]

ξκ

and

()

;

is strictly monotonically increasing on

[ ]

Proof. Assume that

()() ( )

1 111

sza bzcz

ξξ

=+−+−

and

( )( )( )

2 222

sza bzcz

ξξ

=+−+−

are the quad-

ratics. For the spline

( )

[ ]

ξκ

with the boundary conditions (3.5), we have

( )()

1 11

2sz bcz

′=+−

and

( )()

2 22

2sz bcz

′=+−

. The boundary conditions (3.5) imply that

( )( )

( )()( )()

1111

22 22

1, .

as bs

ξξ ξ

γγ

κκ γκργκ

′

= == =

′

= =−−= =−

(3.7)

The derivative and function continuity of the spline at

yields

() ( )

()( )( )( )

12212 1

ccaab b

ηξ ηκ

ηξηκηκ ηξ

−

−− −=

−−−= −+− −−

T. F. Coleman, Y. Y. Li

This leads to

()()()

()()( )

212 1

aab b

ηκ ηξκη

ηξ κηηξ

−

−+− −−+−

=− +−−

Since

( )

ˆ;gz

is continuous at

( )()

ˆ;as g

κκ ρλκ

= ==

. Using (3.3),

( )

λκλκ κκ

= ==

Let

κηηξ

∆=− = −

,. From

, using (3.7), and

κξ

−=∆

( )( )

211 1

12aabb

κξκξ

−= −= −=∆

Since

( )

ˆ;gz

is continuously differentiable at

, using (3.7), we have

22 1

() 22bs b

κλκγ

′

====

Hence

() ()

21 1

3bb b

ηκ ηξ

− −− =−∆

Using above,

21 1

2aa b−= ∆

, and

κη

−=

, we obtain

() ()( )

212 11

aab bb

ηκ ηξκη

−

−+− −−+−=−∆

From the value of

and the above, we have

( )

ηξ

−=−

Hence

( )()

1 111

2 0.

s bcb

η ηξ

′=+−= >

Since

( )

′

and

( )

′

are linear in

[ ]

ξη

and

[ ]

ηκ

respectively, from

( )( )()

1 12

0, 0s ss

ξ ηη

′ ′′

>=>

and

( )

20s

′>

, we conclude that

( )

′

and

( )

′

are strictly monotonically increasing in

[ ]

ξη

and

[]

ηκ

. Since

and

( )

12z

ργ

−−

are strictly monotonically increasing in

[ ]

and

[ ]

κγ

respectively,

we conclude that

( )

;gz

is monotonically increasing in

[ ]



Figure 2 compares the approximation

( )

ˆ;gz

with the approximation

( )

;gz

for a few values of

Left and right subplots visually look very similar. However, the main difference can be seen near

0z=

Replacing

( )

⋅

( )

;gz

, we obtain:

( )

(

)

min ;

xR i

fxg x

µρ

∈=







∑

(3.8)

Each approximation

( )

;

∑

can be considered as a regularization for the empirical function mini-

mization.

This penalty function corresponds to the

penalty around

for

1in≤≤

; it behaves like the count-

ing indicator function when

is very large. The size of each region depends on the parameter

. When

→ +∞

, the

penalty is used in







and the counting indicator function is used in



+∞





. For

any given

, the penalty function is continuously differentiable everywhere except at the origin. The func-

tion

( )

;gz

is piecewise quadratic with a concave quadratic piece for

[]

κγ

∈

T. F. Coleman, Y. Y. Li

Based on (3.8), a gradual nonconvexification process can similarly be applied to track the global minimizer of

(2.3), and (2.1) if

is sufficiently large. Assume that the empirical function

( )

is convex. For notational

simplicit y, let

( )( )

( )

;

k ik

Pxfxgx

µρ

=+ ∑

(3.9)

Starting from the minimizer of

( )

, a sequence of approximations to the penalized empirical error mini-

mization problem (3.8),

{ }

,1, 2,,

k=

is solved by approximating the indicator function. As

converges

to zero,

and

all converge to zero and

converges to

+∞

. Thus

( )

;gz

approaches the quad-

ratic segment

( )

12z

ργ

−−

with the (negative) curvature converging to zero. Hence the optimization problem

(3.8) approaches the empirical error function minimization as

approaches zero. As

increases, the cur-

vature of the quadratic function defining

( )

;

for

[ ]

κγ

∈

becomes more negative, introducing a

graduated nonconvexity. The negative curvature interacts with the positive curvature of the empirical error func-

tion in an attempt to reach the optimal subset solutions via minimizing

( )

. In addition, as

→ +∞

κλ

→

and the functions

( )

;

approach

( )

Figure 3 graphically illustrates how this graduated nonconvexification process tracks the global minimizer of

the minimization problem (2.3) with an one-dimensional function

( )

() 2

fz z= −

. In the top-left subplot of

Figure 3, we see the original nonconvex function

( )()

fzh z

, a convex approximation (corresponding to

0.001

), and its minimizer. Increasing

to 1, we see the next approximation to the original function in

the top-right subplot. With the minimizer of the first approximation function as a starting point, the minimizer of

the new approximation, which is very close to the global minimizer, is computed. In the bottom two subplots

Figure 3. Tracking the global minimizer of

( )( )

fz hz

: Graduated nonconvex approxima-

tions.

T. F. Coleman, Y. Y. Li

(

and 1000 respectively), we see how the approximating functions

( )

approach the original func-

tion as

increases. From this illustration we see that the proposed process first considers large scale features

of the function

( )()

fzh z

and gradually focuses in on features of a smaller scale. This graduated noncon-

vexification process can be terminated when the approximation

( )

;

( )

is sufficiently accurate at

the computed solution.

Next Theorem 3.1 shows that when, for all

either

( )

∗

≤

( )

∗

≥

( )

;

∗

∑

accu-

rately approximates

()

∗

∑

at the computed solution and remains so for any larger

Theorem 3.1. Assume that

( )

κρ

and

( )

γρ

are defined in (3.3),

ξηκ

<<<

, and

( )

ξρ

and

( )

ηρ

are monotonically increasing functions of

. Then the following holds:

• If the first order necessary condition for (3.8) is satisfied at

∗

with

and either

∗

then the first order optimality condition for (3.8) is satisfied at

∗

for any

ρρ

≥

• If the second order necessary condition for (3.8) is satisfied at

∗

with

and either

∗

, then the second order optimality condition for (3.8) is satisfied at

∗

for any

ρρ

≥

Proof. Assu me that

and either

∗

for

1in≤≤

. By definition (3.3),

( )( )

γρ γρ

and

( )()

κρ κρ

for

ρρ

≥

.From the monotonicity assumption of

and

, for

ρρ

≥

( )( )

ξρ ξρ

≥

and

( )( )

ηρ ηρ

≥

. Therefore,

( )( )( )( )

000 0

;;, and ;;.

iii i

nnn n

iiii

xxx x

gx gxgxgx

λλλ λ

ρρρ ρ

∗∗∗ ∗

∗∗∗∗

≠≠≠ ≠

∇=∇∇=∇

∑∑∑ ∑

Hence the first order and the second order necessary conditions for (3.7) hold at

∗

with

, when these

conditions for (3.7) hold at

∗

wit h



Applying Theorem 3.1, the following can be used as a stopping condition for the graduated nonconvexifica-

tion process:

( )( )

either or ,

ik ik

x xi

ξγ

∗∗

< >∀

(3.10)

We also note that

( )

⋅

only can be erroneous in approximating

( )

⋅

[ ]

ξη

. Thus when

γξ

−

sufficientl y small, we can also regard this approximation as sufficiently accurate. Thus, the computation can also

terminate if the region of the inaccuracy

γξ

−

becomes sufficiently small.

The proposed graduated non-convex approximation process is summarized in Figure 4.

4. Computational Results

In this section we illustrate the performance of the proposed computational methods for mode estimation. We

assume that the variables are constrained to be nonnegative; we note that the illustrated properties of the pro-

posed sequence of

approximations to

regularization on sparsity will be similar when the model parame-

ters are unconstrained.

To illustrate, we generate random sparse model selection problems based on least squares data fitting prob-

lems below:

GNC1 Algorithm. Let

be a large constant and

{ }

a monotonically increasing sequence which converges to

+∞

Lettolbe a positive stopping tolerance.

1) Compute a minimizer for minimization problem (3.8) with the

penalty parameter

. Let

1k=

2) Compute a solution to (3.8) with

ρρ

using the computed

solution of (3.8) with

ρρ

−

as a starting point.

3) If either (3.10) or the inequality

tol

γξ

−<

holds, terminate.

Otherwise,

1kk←+

and go to the step 1).

Figure 4. A graduated nonconvexification method for the

sparse model selection.

T. F. Coleman, Y. Y. Li

( )

min 2

Ax bx

≥

−+ Λ

∑

(4.1)

Each random problem instance is generated by a random matrix

A UDV=

where

is an

-by-

random orthogonal matrix,

is an

-by-

random orthogonal matrix, and

is a random diagonal ma-

trix with a condition number equal to a specified constant. The

-by-1 vector

is set to equal to

∗

where

∗

≥

is a random vector with

∗

nonzero components (randomly selected); the nonzero components

equal

( )

100 1rand+

. Here ra nd is a random sample from the uniform distribution in

[ ]

0, 1

. The vector

is a

vector of

independent standard normal with a standard deviation equal to

−

, unless stated otherwise.

Using the graduated nonconvexification algorithm, we compute a solution to (4.1) by solving a sequence of

approximations below:

( )

min ;

Ax bgx

µρ

≥=

−+

∑

(4.2)

where

{ }

is a sequence of monotonically increasing positive numbers. Unless stated otherwise, we set

10.05

ρρ

−

, and

−

. In addition, we let the spline breakpoints be as follows:

1 0.5min,0.5,

1 0.25min,0.5.

ξκ

ηκ



= −







= −





We refer to the computational algorithm in Figure 4 with the specified parameter setting above as GNC1. We

use the trust region algorithm for bound constrained problems proposed in [30] to solve (4.2). We illustrate

various properties of the computed model

. Specifically we report empirical error, sparsity, and errors com-

pared to the true model

∗

used to generate data. In measuring sparsity, we regard a component

ˆi

as zero if

ˆ10

−

≤

. In addition, we assess the approximation error of using the continuous approximation

( )

h⋅

( )

h⋅

to the indicator function at the computed solution

. We also evaluate the computational cost of the graduated

nonco n ve xi fic ati o n process and report the total number of optimization iterations required to obtain the estimate

and the average number of optimization iterations required for minimizing each



Specifically, based on 100 random problem instances, we report the following attributes based on the average

values from 100 random problem instances:

• Empirical error:

Ax b−

• Average relative distance to the true model:

∗

−

• Sparsity: the number of zeroes in the computed solution

, i.e.,

( )

ˆ10

−

Λ≤

∑

• Accuracy in approximating the counting indicator function

( )

zΛ

( )

: recall that the ap-

proximation is accurate if

≤√

implies that

0z=

. Thus we measure this accuracy by

( )

ˆ10

−

Λ≤



Λ≤



√



∑

; value 1 means 100% accuracy.

• Average number of optimization iterations over all

:itn

• Total number of optimization iterations for the entire gradual non-convexification process:

itn

∑

Table 1 reports the performance assessment for GNC1based on the piecewise linear approximation

( )

to the counting function

( )

zΛ

in the specified parameter setting. The number of observationsequals 200 and

the dimension of

∗

equals 100. Two subpanels correspond to

( )

∗

∑

the number of nonzeros in

∗

equals to 50 and 90 respectively. The column under

provides the properties of the least squares solution

T. F. Coleman, Y. Y. Li

for the given data; we see that the average least squares error

Ax b−

is approximately 0.1. As the penalty

parameter

increases, the average data fitting error increases but sparsity in the estimate

also increases

from 0 to about 70.

In addition observe from Table 1 that there is little difference in the fitting error when

0, 10

−



∈

but the

sparsity in the solution estimate

is drastically different, from no zero components to more than 35 when the

true solution

∗

has about 50 zero components, i.e.,

∗

. From Table 1, we also observe that the relative

average distance to the true parameter

∗

is smaller than that of the least squares solution, when the penalty

parameter is

(

0, 10

−



∈

for which estimate

is fairly sparse. This indicates that sparsity regularization is

indeed important in accurate model estimation. We also note that the error in using

( )

h⋅

to approximate the

counting indicator function at the computed estimate

is very small with a minimum correctness of 80% for

penalty parameter

−

; complete accuracy of 100% is achieved when

−

≥

One potential concern of the proposed graduated non-convex me thod is the computational cost since a poten-

tially large sequence (corresponding to the parameter sequence

{ }

of optimization problems need to be

solved. However, since for each minimization of



, the solution from

1k−

is used as a starting point, the to-

tal number of optimization iterations required is quite reasonable.

For example, when

50K

∗

, on average a total of 77.31 iterations are required to compute for the entire

GNC1 process. Since the average number of iterations for minimizing

is 4.83, this implies that the GNC1

process here terminates after about 16 steps on average. It can be observed that the total number of iterations

required increases with the penalty parameter

We also investigate performance for the underdetermined data fitting problems; the number of observations is

less than the number of unknown variables (specifically

100m=

and

200

). Table 2 illustrates that the

relative distance to the true model

∗

is significantly larger in the underdetermined case than in the over de-

termined case; the corresponding entries in Table 1 are included in Table 2 for comparison. This is reasonable

since there is less information available to infer the true model.

Table 1. GNC1 performance statistics from 100 random problems with the number of observations m greater than the di-

mension n:

=10

ρρ

with

= 1e-005 and

1= 0.05

= 0

1e-006

0.0001

0.01

Number of nonzeros in the generator K

= 50 m = 200 n = 100

Ax b−

0.098 0.108 0.111 0.257 2.380

−

0.199 0.172 0.188 0.469 0.686

( )

|| 10

−

Λ≤

∑

0.000 20.610 35.020 53.480 68.700

( )

()

ˆ|10

ˆ|1

−

Λ≤

∑

0.000 0.812 0.942 1.000 1.000

itn

0.000 2.123 4.832 7.947 7.281

itn

∑

0.000 33.970 77.320 127.150 116.490

Number of nonzeros in the generator K* = 90 m = 200 n = 100

Ax b−

0.099 0.102 0.103 0.309 2.831

−

0.160 0.150 0.177 0.485 0.694

( )

|| 10

−

Λ≤

∑

0.000 5.260 9.820 31.140 53.660

( )

|| 10

| |1

−

Λ≤

∑

0.000 0.805 0.948 1.000 1.000

itn

0.000 2.023 3.094 7.353 8.368

itn

∑

0.000 32.370 49.510 117.650 133.890

T. F. Coleman, Y. Y. Li

Table 2. Distance to the true model (GNC1): overdetermined vs u nderd etermined ,

=10

ρρ

with

= 1e-005 and

1= 0.05

Distance to the True Model:

−

= 0 1e-006 0.0001 0.01 1

(m,n) Number of nonzeros in the generator K* = 50

(200,100) 0.199 0.172 0.188 0.469 0.686

(100 200) 0.720 0.714 0.715 0.781 0.857

(m,n) Number of nonzeros in the generator K* = 90

(200,100) 0.160 0.150 0.177 0.485 0.694

(100 200) 0.722 0.718 0.721 0.784 0.861

5. Conclud ing Remarks

In high dimensional data fitting problems, the objective is to minimize an empirical loss function based on data

while achieving sparsity in the model parameters at the same time. Achieving sparsity can be crucial in obtain-

ing robust out-of-sample performance and attaining meaningful understanding of the causal relationship in data.

In addition, sparsity may be an explicit goal in practical applications such as tracking market indices with a

s mall number of assets.

The combination of minimizing empirical loss and maximizing sparsity naturally leads to a minimization

problem regularized by a penalty, which is proportional to the number of variables describing the model to be

estimated, this problem with many local minimizers.

The empirical loss function is often convex. Assuming this property, we propose a graduated non-convex al-

gorithm to minimize a convex empirical loss function regularized with a penalty proportional to the number of

nonzero variables.

The proposed algorithm is based on approximating the counting indicator function by a continuous piecewise

linear function. We show mathematically that the continuous piecewise linear approximation to the counting in-

dicator function has more attractive properties than the continuous piecewise quadratic approximation. Specifi-

cally, there exists a sufficiently large parameter

so that the regularized optimization problem using this ap-

proximation is equivalent to the original optimization problem regularized by the number of nonzero in the

model parameters. This property does not hold for the continuous piecewise quadratic approximation.

A graduated nonconvexification process is proposed by introducing a family of approximations to the con-

tinuo us piecewise linear function. This family of approximations is indexed by a nonnegative parameter

. In

addition, these approximations can be regarded as penalty functions themselves with the following properties.

Firstly there is a region around the origin in which the penalty function is the

penalty. In another region, the

penalty function equals counting indicator functions. As ρ increases, both regions increases and the region with

penalty converges to







and the region with the counting indicator function as penalty converges to



+∞





. In addition, for any

, the penalty function is continuously differentiable everywhere except at

the origin. Each penalty function is an even function with a monotonicity property. When the parameter

small, a small negative curvature is added to the convex empirical loss funct io n to form a regularized objective

function. As ρ increases, more negative curvature is added but the region in which the penalty is nonconvex

shri nk s.

We investigate performance of the proposed graduated nonconvexification algorithm based on randomly gen-

erated least squares problems with different sparsity levels in the true solution

∗

We observe that the data fit-

ting error increases as the penalty parameter μ increases. Simultaneously sparsity in the computed solution in-

creases as the penalty increases. In addition, sparse solutions (with sparsity close to the true solutions) can be

T. F. Coleman, Y. Y. Li

computed without much compromise in the magnitude of the data fitting error. Indeed, the solution with the

smallest relative distance to the true solution is obtained by sparse solutions with the sparsity close to that of the

true solutions.

Our results also indicate that the computational costs required by the GNC process is relatively moderate,

since the computed solution in the kth step in the graduate nonconvexification process is often a good starting

point for the (k+1)th step.

References

[1] Barron, A., Birge, L. and Massart, P. (1999) Risk Bounds for Model Selection via P enali zation. Probability Theory and

Related Fields, 113 , 301 -302. http://dx.doi.org/10.1007/s004400050210

[2] Natarajan, B.K. (1995 ) Sparse Approximate Solutions to Linear Systems. SIAM Journal on Computing, 24, 227-234.

http://dx.doi.org/10.1137/S0097539792240406

[3] Davis, G. , Mallat, S. and Avellaneda, M. (1997) Greedy Adaptive Approximation. Constructive Approximation, 13,

57-98. http://dx.doi.org/10.1007/BF02678430

[4] Tropp, J.A. (2006) Just Relax: Convex Programming Methods for Identifying Sparse Signals in Noise. IEEE Transac-

tions on Information Theory, 52, 1030-1051.

[5] Tropp, J.A. (2004) Just Relax: Convex Programming Methods for Subset Selection and Sparse Approximation. Tech.

Rep. ICES Report 0404, The University of Texas at Austin.

[6] Taylor, H.L., Banks, S.C. and McCoy, J. F. (1979) Deconvolution with the l1 No r m. Geophysics, 44, 39-52.

http://dx.doi.org/10.1190/1.1440921

[7] Levy, S. and Fullagar, P.K. (1981) Reconstruction of a Sparse Spike Train from a Portion of Its Spectrum and Applica-

tion to High -Resolution Deconvolution. Geophysics, 46, 1235 -12 43. http://dx.doi.org/10.1190/1.1441261

[8] Wu, J.K. (1994 ) Two Problems of Computer Mechanics Program System. In: Proceedings of Finite Element Analysis

and CAD, Peking University Press, Beijing, 9-15.

[9] Santosa, F. and Symes, W.W. (1986) Linear Inversion of Band -Limited Reflection Seismograms. SIAM Journal on

Scientific and Statistical Computing, 7, 1307-1330. http://dx.doi.org/10.1137/0907087

[10] Li, Y. and Santosa, F. (1996) A Computational Algorithm for Minimizing Total Variation in Image Restoration. IEEE

Transactions on Image Processing, 5, 987-995. http://dx.doi.org/10.1109/83.503914

[11] Coleman, T.F., Li, Y. and Mariano, A. (2001) Segmentation of Pulmonary Nodule Image Using 1-Norm Minimization.

Computational Optimization and Application, 19, 243-272.

[12] Donoho, D.L., Elad, M. and Temlyakov, V.N. (2006) Stable Recovery of Sparse Overcomplete Representations in the

Presence of Noise. IEEE Transactions on Information Theory, 52, 6-17. http://dx.doi.org/10.1109/TIT.2005.860430

[13] Candès, E.J. (2006) Compressive Sampling. In: Proceedings of the International Congress of Mathematics, Madrid,

Spain .

[14] Candés, E.J., Romberg, J. and Tao, T. (2006) Robust Uncertain ty Princip les: Exact Signal Reconstruction from Highl y

Incomplete Frequency Information. IEEE Transactions on Information Theory, 52, 489-509.

http://dx.doi.org/10.1109/TIT.2005.862083

[15] Kim, S., Koh, K., Boyd, S. and Gorinvesky, D. (2007) An Interior Point Method for Lar ge-S cale l1 Regularized Least

Squ ares. IEEE Journal on Selected Topics in Signal Processing, 1, 606-617.

http://dx.doi.org/10.1109/JSTSP.2007.910971

[16] Wright, S.J., Nowak, R.D. and Figueiredo, M. A. T. (2009) Sparse Reconstruction by Separable Approximation. IEEE

Transactions on Signal Processing, 57, 2470-2493. http://dx.doi.org/10.1109/TSP.2009.2016892

[17] Daubechies, I., Devore, R., Fornasier, M. and Güntürk, C.S. (2009) Iteratively Re-Weighted Least Squares Minimiza-

tion for Sparse Recovery. Communications on Pure and Applied Mathematics LXIII (2010) 0001-0038, 2470-2493.

[18] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Jour nal of

the American Statistical Association, 96, 1348-13 60. http://dx.doi.org/10.1109/TSP.2009.2016892

[19] Fan, J. and Li, R. (2006) Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery.

In: Sanz-Sole, M., Soria, J., Carona, J.L. and V erd era, J., Eds., Proceedings of the International Congress of Math e-

matics, 595-622.

[20] Lv, J. and Fan, Y. (2009) A Unified Approach to Model Selection and Sparse Recovery Using Regularized Least

Squ ares. The Annals of Statistics, 37, 3498-3528. http://dx.doi.org/10.1214/09-AOS683

[21] Fan, J., Lv, J. and Qi, L. (2011) Sparse High Dimensional Models in Econ omics. Annual Review of Economics, 3, 291 -

317.

T. F. Coleman, Y. Y. Li

[22] Fan, J. (1997) Comments on “Wavelets in Statistics: A R evie w”. Journal of the Italian Statistical Society, 6, 131-138.

http://dx.doi.org/10.1007/BF03178906

[23] Hunter, D.R. and Li, R. (2005) Variable Selection Using mm Algorithms. The Annals of Statistics, 33, 1617-1642.

http://dx.doi.org/10.1214/009053605000000200

[24] Zhou, H. and Li, R. (2008) On e -Step Sparse Estimates in Nonconvex Penalized Likelihood Models (with Discussi on).

The Annals of Statistics, 36, 1509-1533. http://dx.doi.org/10.1214/009053607000000802

[25] Cand´es, E.J., Wakin, M.B. and Boyd, S.P. (2008) Enhancing Sparsity by Reweighted l1 Minimization. J. FourierAna

Appl, 14, 877-905. http://dx.doi.org/10.1007/s00041-008-9045 -x

[26] Chartrand, R. and Yin , W. (20 08 ) Iteratively Reweighted Algorithms for Compressive Sensing. In: IEEE International

Conference on Acoustics, Speech and Signal Processin g, 3869-3872.

[27] Zhang, T. (2010) Analysis of Multi-Stage Convex Relaxation for Sparse Regularization. Journal of Machine Learning

Research , 10, 1081-11 07.

[28] Blake, A. and Zisserman, A. (1987) Visual Reconstruction. Cambridge.

[29] Coleman, T.F ., Henninger, J. and Li, Y. (2006) M inimizing Tracking Error While Restricting the Number of Asse t s .

Journal of Risk, 8, 35-56.

[30] Coleman, T.F. and Li, Y. (1996) An Interior, Trust Region Approach for Nonlinear Minimization Subject to Bounds.

SIAM Journal on Optimization, 6, 418-445. http://dx.doi.org/10.1137/0806023