^{1}

^{*}

^{2}

In a survival analysis context, we suggest a new method to estimate the piecewise constant hazard rate model. The method provides an automatic procedure to find the number and location of cut points and to estimate the hazard on each cut interval. Estimation is performed through a penalized likelihood using an adaptive ridge procedure. A bootstrap procedure is proposed in order to derive valid statistical inference taking both into account the variability of the estimate and the variability in the choice of the cut points. The new method is applied both to simulated data and to the Mayo Clinic trial on primary biliary cirrhosis. The algorithm implementation is seen to work well and to be of practical relevance.

In survival analysis, when interest lies on the estimation of the hazard rate, an attractive and popular model is the piecewise constant hazard model. This model is easy to interpret as the hazard rate is supposed to be constant on some pre- defined time intervals and plotting the hazard rate gives a quick sense of the evolution of the event of interest through time. Many epidemiological studies use this model to represent the hazard rate function either because it provides an interesting way to fit the hazard function or because the data are not available on the individual level. See for instance

While this model can be used in a nonparametric setting, it is often used in combination with covariates effects. This is the case for instance for the popular Poisson regression model (see [

When modeling covariates effect through a proportional hazard model, [

and the likelihood simplifies into the Cox partial likelihood where the regression effect can be estimated separately from the baseline. While this is a very interesting aspect of the Cox model, this nice separation between baseline estimation and regression effect estimation does not hold anymore in many extensions of this model. For instance, in frailty models (see among many other authors [

In the joint modeling framework one wants to model the association between a longitudinal variable and a time to event response through a random effect (see [

Other contexts where the partial likelihood approach does not work anymore include the cure models framework (see for instance [

In Section 2, the piecewise constant hazard model is recalled and the adaptive ridge estimator is applied to this model. Section 3 proposes two different pro- cedures to choose the penalty term involved in the estimation procedure. Section 4 proposes a bootstrap based method to obtain valid inference for survival dis- tribution quantities such as the survival function. A simulation study is conduc- ted in Section 5, where the efficiency of the estimation method is evaluated and the two different procedures to choose the penalty term are compared. The method is applied to the Mayo Clinic trial on primary biliary cirrhosis in Section 6 and a small discussion concludes the paper in Section 7.

Let T ∗ represent the survival time of interest. In practice T ∗ might be censored by a random variable C so that we observe ( T = T ∗ ∧ C , Δ = I ( T ∗ ≤ C ) ) . The data consist of n independent replications ( T i , Δ i ) i = 1 , ⋯ , n . We aim at estimating the hazard function defined for t ≥ 0 by:

λ ( t ) = lim Δ t → 0 ℙ [ t ≤ T ∗ < t + Δ t | T ∗ ≥ t ] Δ t ⋅

In the following, this hazard function is assumed to be piecewise constant on L cuts represented by c 0 , c 1 , ⋯ , c L , with the convention that c 0 = 0 and c L = + ∞ . Let I l ( t ) = I ( c l − 1 < t ≤ c l ) . We suppose that

λ ( t ) = ∑ l = 1 L I l ( t ) α l ,

for α l ≥ 0 , l = 1 , ⋯ , L . Note that the exponential baseline hazard is obtained

from L = 1 in the piecewise constant hazard family. Let Λ ( t ) = ∫ 0 t λ ( s ) d s re-

presents the cumulative hazard function and denote by α = ( α 1 , ⋯ , α L ) the model parameter we aim to estimate.

In order to make inference on the model parameter we will assume inde- pendent right censoring and non-informative censoring (see [

L n ( α ) = ∑ i = 1 n { log ( λ ( T i ) ) Δ i − ∫ 0 T i λ ( t ) d t } ,

where the equality holds true up to a constant that does not depend on the model parameter α . For computational purpose, it is interesting to note that the log-likelihood can be written in a Poisson regression form. Introduce R i , l = I ( T i ≥ c l − 1 ) ( c l ∧ T i − c l − 1 ) , the total time individual i is at risk in the l th interval ( c l − 1 , c l ] , O i , l = I l ( T i ) Δ i , the number of events for individual i in the l th subinterval. Also R l = ∑ i = 1 n R i , l and O l = ∑ i = 1 n O i , l are sufficient statistics and estimation can be carried out using only these two statistics. The log- likelihood can then be written again as (see [

L n ( α ) = ∑ l = 1 L { O l ( log ( α l ) − α l R l } . (1)

Since L n is concave, the maximum likelihood estimator has an explicit solu- tion, obtained from derivation of the log-likelihood: for l = 1 , ⋯ , L ,

α ^ l = O l R l . (2)

For computational purpose, introduce a l such that α l = exp ( a l ) and a = ( a 1 , ⋯ , a L ) T the vector parameter we aim to estimate. Using the L0 penalty from [

L n pen ( a , w ) = ∑ l = 1 L { O l a l − exp ( a l ) R l } − pen 2 ∑ l = 1 L − 1 w l ( a l + 1 − a l ) 2 , (3)

where w = ( w 1 , ⋯ , w L − 1 ) are non-negative weights that will be iteratively updated in order for the weighted ridge penalty term to approximate the L0 penalty.

The score vector is denoted U ( a , w ) = ∂ L n pen ( a , w ) / ∂ a and its l th com- ponent, l ∈ { 1 , ⋯ , L } , is equal to:

O l − R l exp ( a l ) + ( w l − 1 a l − 1 − ( w l − 1 + w l ) a l + w l a l + 1 ) pen ,

with the convention w 0 = w L = a 0 = a L + 1 = 0 . Now introduce

I ( a , w ) = − ∂ U ( a , w ) / ∂ a T , the opposite of the Hessian matrix. I ( a , w ) is a L × L non-negative definite band matrix whose bandwidth equals 1. Its diagonal elements are equal to

I ( a , w ) l , l = R l exp ( a l ) + ( w l − 1 + w l ) pen ,

other elements next to the diagonal are defined for l = 1 , ⋯ , L − 1 by

I ( a , w ) l , l + 1 = I ( a , w ) l + 1 , l = − w l pen ,

and all other elements are equal to zero, that is for l , l ′ such that | l − l ′ | ≥ 2 , I ( a , w ) l , l ′ = 0 .

The vector parameter a is updated using the Newton-Raphson algorithm. For a given sequence of weights w ( m − 1 ) obtained at the ( m − 1 ) th step, the m th Newton Raphson iteration step is obtained from the equation

a ( m ) = a ( m − 1 ) + I ( a ( m − 1 ) , w ( m − 1 ) ) − 1 U ( a ( m − 1 ) , w ( m − 1 ) ) .

The inversion of the band matrix is performed through a fast (linear com- plexity) C++ implementation of the well-known LDL algorithm (variant of the LU decomposition for symmetric matrices). Initialization of the Newton Raph- son algorithm can be obtained from the classical unpenalised estimator of the piecewise constant hazard model, that is a ( 0 ) = arg max a L n ( a ) . See [

On the other hand, following the recommendation from [

w l ( m ) = ( ( a l + 1 ( m ) − a l ( m ) ) 2 + δ 2 ) − 1 ,

for l = 1 , ⋯ , L − 1 with δ = 10 − 5 . Briefly, this form of the weights is motivated by the fact that w l ( a l + 1 − a l ) 2 is close to 0 when | a l + 1 − a l | < δ and close to 1 when | a l + 1 − a l | > δ . Hence the penalty term tends to approximate the L0 norm. The weights are initialized by w l ( 0 ) = 1 , which gives the standard ridge estimate of a .

In this section we propose two different ways to choose the correct penalty term. The first one is based on a standard cross-validation criterion while the second one is based on a BIC criterion.

In order to choose the right penalty term, one must first define a large grid of penalty values such that the criterion (cross-validation or BIC) will be evaluated at each of these penalty terms. For that purpose, the algorithm can benefit from a warm start of the penalty weights. Indeed, instead of initializing the weights to 1 for each penalty value, one can take the final weights of the previous (smaller) penalty as a starting point for the next (larger) penalty. In this way, full re- gularization path similar to those of the LASSO can be generated very efficiently. Note, however, that this warm-starting is not necessary since it is always possible to initialize the algorithm with neutral weights of value 1. A preliminary set of cut positions must also be chosen. For simplicity we recommend to take a large set of equally spaced points including the range of the observed time point values. See Sections 5 and 6 to see how this works in practice.

Split the data in k pieces and define a ^ pen − I as the maximizer of the penalized likelihood in Equation (3) when part I is left out from the data. Then the k - fold cross validated log-likelihood is defined by:

c v ( pen ) = ∑ I L I ( a ^ pen − I ) ,

where L I represents the unpenalized log-likelihood as in Equation (1) but computed only in part I of the data. Maximizing this quantity with respect to pen gives the optimal penalty term.

Note that unlike the Cox regression framework where the baseline is left unspecified, this cross-validated criterion is well defined since in our case the hazard rate is constructed as a continuous function of time. Also, the relation

∑ I L I ( a ^ pen − I ) = ∑ I { L n ( a ^ pen − I ) − L − I ( a ^ pen − I ) }

holds such that our criterion is completely equivalent to the cross-validated criterion developed by [

In order to improve efficiency and time speed in the computation programs, the 10-fold cross validation is recommended in practice.

The following criterion can be used as an alternative to the choice of the penalty term. It is defined as a balance between good fit of the data and low complexity of the model parameters. It is fast to compute and has the following expression:

BIC ( pen ) = − 2 L n ( a ^ pen ) + d log ( n ) .

The parameter estimator a ^ pen is defined as the maximizer of the penalized likelihood in Equation (3) while d represents the model complexity. It is equal to the number of distinct consecutive values of the a l s in a ^ pen :

d = ∑ l = 0 L − 1 I ( a ^ l + 1 , pen − a ^ l , pen ≠ 0 ) ,

with the convention a 0 = 0 .

The performance in the choice of the penalty term by both criteria is inves- tigated in the simulation study in Section 5.

In practice it is of interest to derive confidence intervals for marginal quantities directly related to the time to event variable such as the cumulative hazard function or the survival function. Asymptotic properties of the piecewise-cons- tant hazard model for a given set of cut points is straightforward and has been already derived (see for instance [

One way to take into account the uncertainty in the choice of the cut points is to use a resampling technique where for each sample a different penalty term is chosen from the cross-validated or BIC criterion. This will provide a new hazard estimate with a different set of cut points for each sample. Taking the adequate quantile at each time point allows us to obtain pointwise confidence intervals of the correct order for the quantity of interest.

Interestingly, this resampling technique also allows us to compute an alternative pointwise estimate of the survival function (or of any marginal distribution quan- tity) by taking the pointwise medians of each bootstrap sample. This provides a very smooth estimate function and, in that sense, this kind of estimate can be seen as a smooth non-parametric estimate of the survival function.

This bootstrap procedure is illustrated in Sections 5 and 6 to derive con- fidence intervals and smooth estimates for the survival function.

We illustrate the proposed method to estimate the following hazard function:

λ ( t ) = ( 0 for t ∈ [ 0 , 20 ] , 0.5 × 10 − 2 for t ∈ ( 20 , 40 ] , 1 × 10 − 2 for t ∈ ( 40 , 50 ] , 2 × 10 − 2 for t ∈ ( 50 , 70 ] , 4 × 10 − 2 for t > 70.

The censoring distribution is simulated as a uniform distribution over the time interval [70,90] which gives on average 62 % of observed failures. On average, 9.5 % of the observations fall into the interval (20,40], 8.5 % of the observations fall into the interval (40,50], 27 % of the observations fall into the interval (50,70] and 55 % of the observations fall into the interval ( 70 , + ∞ ) .

We start with a single sample of size 100 generated from this model. Using the true cuts, the classical unpenalized hazard estimator derived from Equation (2) is computed on

these two situations. The set of all possible cuts was chosen as all the integer values ranging from 1 to 100 and the set of penalty terms was taken, on the log scale, as the set of 100 equally spaced values ranging from log ( 0.1 ) to log ( 1 000 ) . On this sample the BIC and cross-validation criteria respectively chose the penalty values equal to 0.95 and 1.15 which both gave the same estimate.

In order to assess the good performance of our penalized estimator, we also conducted Monte-Carlo simulations from the model scenario presented in this section with 600 sample replications. We considered samples of size 100, 400 and 1000 and in each case we computed the probability distribution of the number of cuts found by the BIC method and by the cross-validation method. The results are reported in

the sample size increases, the proportion of times the four breakpoints are found increases. Looking at the total-variation distance, we see that for both methods, the estimate becomes more and more accurate as the sample size increases. In general, the BIC criterion outperforms the cross-validation criterion both in terms of breakpoints detection and fitting of the hazard function.

One should note that the simulation scenario presented here makes it difficult to estimate the hazard function due to the low value of the hazard rates for t < 70 . For a moderate sample size, n = 100 for instance, very few observations will fall in each cut interval (only 8.5 % in the interval ( 40 , 50 ] for example) and therefore the method has difficulties to find all the cuts. The problem

disappears as we increase the sample size. We considered other simulation settings where the proportion of observations falling into each cut interval was more balanced. This resulted in a very good performance of the estimator for small samples, both to detect the true number of cuts and to accurately fit the hazard function.

We now consider the following Weibull model, where this time, the true hazard is a continuous function of time: λ ( t ) = a ( t / b ) a − 1 / b where a = 5 is the shape

Number | Proportions found for: | ||
---|---|---|---|

of cuts | |||

0 | 0.000 | 0.000 | 0.000 |

1 | 0.000 | 0.000 | 0.000 |

2 | 0.202 | 0.005 | 0.000 |

3 | 0.363 | 0.328 | 0.038 |

4 | 0.202 | 0.375 | 0.737 |

5+ | 0.233 | 0.292 | 0.225 |

Number | Proportions found for: | ||
---|---|---|---|

of cuts | |||

0 | 0.000 | 0.000 | 0.000 |

1 | 0.075 | 0.000 | 0.000 |

2 | 0.338 | 0.032 | 0.000 |

3 | 0.323 | 0.280 | 0.045 |

4 | 0.105 | 0.352 | 0.615 |

5+ | 0.158 | 0.337 | 0.340 |

BIC | 0.362 | 0.176 | 0.085 |

CV | 0.370 | 0.184 | 0.092 |

parameter and b = 60 is the scale parameter. This gives an average time value of 55 and a time standard deviation of 12.6. The censoring distribution is also simulated as a Weibull variable but with shape parameter equal to 30 and a scale parameter equal to 60. This gives the same average percentage of observed failures ( 62 % ) as in the previous simulation setting.

As before we start with a single sample of size 100 generated from this model and we compute our adaptive ridge estimator using the same grid of cut points and the same grid of penalty values as in the previous scenario. The penalty value was chosen equal to 0.95 from the BIC criterion. Since we are estimating a continuous function of time it seems of interest to see how a smoother estimate would perform on this Weibull distribution. Our penalized likelihood can be easily modified to get a ridge estimate of the hazard by putting all the weights w equal to 1 in Equation (3). This gives a simpler algorithm where the weights do not need to be updated and only a Newton-Raphson agolrithm is performed on the parameter vector a . However no simple criterion can be proposed to choose the penalty value in this setting and we arbitrarily chose a large value equal to 40 in order to force the estimator to be smooth. Plots of our adaptive ridge estimator, our ridge estimator and the true Weibull hazard are displayed in

Finally, Monte-Carlo experiments were conducted to assess the quality of fit of our estimators for the Weibull hazard function. This was measured as before

in terms of total variation distance between the true hazard and the adaptive ridge or the ridge estimator on the time interval [ 0 , 60 ] . As an illustration, on the sample example of size 100 of

We consider here the dataset from the Mayo Clinic trial in primary biliary cirrhosis (pbc) presented in [

Finally, the boostrap procedure is used to derive the survival estimate with its 95 % pointwise confidence interval for the time to death. The curves are displayed on

Adaptive Ridge | 0.347 | 0.228 | 0.172 |

Ridge | 0.204 | 0.115 | 0.086 |

95 % confidence interval for the survival at this time is approximately [ 0.71 , 0.79 ] .

In this article, we proposed an innovative method to estimate the hazard rate in a piecewise constant model. The estimator is defined as the maximum of a penalized likelihood and allows to automatically detect the number and cuts location of the model and to estimate the hazard on each cut interval. A boot- strap procedure was also proposed in order to derive valid statistical inference

taking both into account the variability of the estimate and the variability in the choice of the cut points. In order to select the penalty term we recommend using the BIC criterion as it seems to outperform the cross-validation criterion and it is also very fast to compute. Finally if one is interested in obtaining a smooth estimate of the hazard function, a small modification of the original estimator allows to derive a ridge version which has been shown to provide a very good fit to continuous survival distributions.

This work was established in the nonparametric setting of right censored data but many extensions can be considered. Including covariates in the model through a Poisson regression modeling for instance should be straightforward. As a matter of fact, since the method uses a penalized likelihood approach, no explicit estimators are available and even in the nonparametric setting the estimator is derived from the Newton-Raphson algorithm. In the nonparametric and regres- sion settings, by modifying the likelihood formula, the method should also readily extend to truncation and to other types of censoring such as interval censoring. More difficultly it would be interesting to see how the penalized like- lihood approach works in a frailty, joint modeling or cure model context. Using the L0 approach in these contexts amounts to fit a penalized parametric model which makes our method very appealing due to the nice properties of parametric models. Besides, our resampling method allows to derive smooth estimates of time dependent quantities of interest. As a result it is seen that our method nicely combines both the advantages of a parametric implementation and non- parametric fit of survival quantities.

The L0 approach was used to constrain two consecutive cuts in the piecewise constant hazard model to be equal. Interestingly, a different model could be pro- posed where straight lines connect the consecutive cuts. In that model, the L0 approach could be derived by constraining two consecutive slopes of lines to be equal. In the same idea, spline hazard functions could also be constructed by penalizing further order derivatives of polynomial functions. All these extensions are left to future research.

Bouaziz, O. and Nuel, G. (2017) L0 Regularization for the Estimation of Piecewise Constant Hazard Rates in Survival Analysis. Applied Mathe- matics, 8, 377-394. https://doi.org/10.4236/am.2017.83031