﻿Quantile Regression Based on Semi-Competing Risks Data

Open Journal of Statistics
Vol.3 No.1(2013), Article ID:27915,15 pages DOI:10.4236/ojs.2013.31003

Quantile Regression Based on Semi-Competing Risks Data

Jin-Jian Hsieh1, A. Adam Ding2, Weijing Wang3, Yu-Lin Chi1

1Department of Mathematics, National Chung Cheng University, Chia-Yi, Chinese Taipei

2Department of Mathematics, Northeastern University, Boston, USA

3Institute of Statistics, National Chiao-Tung University, Hsin-Chu, Chinese Taipei

Email: jjhsieh@math.ccu.edu.tw, ding@neu.edu, wjwang@stat.nctu.edu.tw

Received December 7, 2012; revised January 10, 2013; accepted January 22, 2013

Keywords: Copula Model; Dependent Censoring; Quantile Regression; Semi-Competing Risks Data

ABSTRACT

This paper considers quantile regression analysis based on semi-competing risks data in which a non-terminal event may be dependently censored by a terminal event. The major interest is the covariate effects on the quantile of the non-terminal event time. Dependent censoring is handled by assuming that the joint distribution of the two event times follows a parametric copula model with unspecified marginal distributions. The technique of inverse probability weighting (IPW) is adopted to adjust for the selection bias. Large-sample properties of the proposed estimator are derived and a model diagnostic procedure is developed to check the adequacy of the model assumption. Simulation results show that the proposed estimator performs well. For illustrative purposes, our method is applied to analyze the bone marrow transplant data in [1].

1. Introduction

Quantile regression analysis has received increasing attentions in the recent literature of survival analysis. Compared with conventional regression models such as the proportional hazards (PH) model or the accelerated failure time (AFT) model, quantile regression models provide direct assessment of the covariate effect on different quantiles of the failure time variable. This model also allows covariates to affect both location and shape of the distribution. Let T be the failure time of interest, be a vector and. Consider the following linear quantile regression model on, where is a known monotonic function, such that

(1)

where and is the th quantile of Y conditional on Z. Note that when we set, model (1) is equivalent to . Many papers for estimating without specifying the distribution of or have appeared in the literature. [2-5] considered quantile regression analysis under a fixed censoring mechanism in which all the censoring times are observed. Independent right censorship has been assumed by many papers including [6-11].

In this paper, we consider semi-competing risks data [12] in which the failure time of a non-terminal event T is subject to dependent censoring by a terminal event time D but not vice versa. Consider an example of bone marrow transplantation for leukemia patients described in [1] such that T is the time to leukemia relapse and D is the time to death. One important risk factor is the disease classification (i.e. ALL, AML low-risk, and AML highrisk) which was determined based on patient’s status at the time of transplantation. Here we assume that T, the time to a non-terminal event, follows model (1). Note that [13,14] also considered quantile regression analysis for competing risks data and left-truncated semi-competing risks data respectively. They defined the quantiles based on the crude quantity, namely the cumulative incidence function. In contrast, the proposed regression model (1) is defined based on the net quantity which is not identifiable without extra assumption on the dependence structure. There has been some controversy over which quantity should be used in presence of dependent competing risks. We believe that both quantities are important and not mutually exclusive as they provide information on different aspects of the data. Here measures the covariate effect on T after separating the potential influence from D. Such analysis is also useful in practical applications. For example, a covariate may prolong D so that increase but have no direct effect on the nonterminal event. The dependence between T and D complicates the estimation of. We will adopt a semi-parametric copula assumption to model their joint distribution and apply the technique of inverse probability weighting (IPW) to correct the bias due to dependent censoring in the estimation procedure. The association parameter in the copula model will also be estimated using existing methods.

The rest of this paper is organized as follows. In Section 2, we introduce the data structure and model assumptions. The proposed methodology for parameter estimation and model checking is presented in Section 3. The proofs of the asymptotic properties are given in the Appendix. Section 4 contains simulation results. In Section 5, we apply the proposed methods to analyze the bone marrow transplant data in [1] and in Section 6, we give some concluding remarks.

2. Data and Model Assumptions

Recall that T and D denote the time to a non-terminal event and the time to a terminal event respectively such that T is subject to censoring by D but not vice versa. In presence of additional external censoring due to drop-out or the end-of-study effect, one observes such that, , , , where is the minimum operator and is the indicator function. The covariate vectors can be denoted as and. The sample Contains which are random replications of. We will assume that and C are independent given Z. The covariate effect on T is specified by model (1) and the major objective is to estimate based on semi-competing risks data.

To handle dependent censoring, we have to make extra assumptions about the dependence structure between T and D in the upper wedge. According to [15] who extended Sklar’s theorem to the regression setting, we consider the following copula model

(2)

where, and are the marginal survival functions of T and D, given, and is a parametric copula function defined on the unit square. The association parameter in (2) is related to Kendall’s tau defined by

In particular, we will assume in the upper wedge follows a popular subclass of copula models, namely Archimedean copula (AC), in which the copula function can be further expressed as

(3)

where is a non-increasing convex function defined on with. Examples of Archimedean copula include Clayton’s copula with

and

;

and Frank’s copula with

and

.

3. The Proposed Inference Methods

Our major objective is to develop an inference method for estimation but, in the mean time, employ existing methods for estimating based on semicompeting risks data such as those proposed by [16] and [17].

3.1. Estimation of for Discrete Covariates

In absence of censoring, one can estimate by solving

Since is subject to censoring by, it follows that

where the reciprocal of the weight function is given by

The above derivations yield the following estimating function for

This is the so called inverse probability weighting technique for bias correction. Since needs to be estimated, it is natural to modify the estimating equation as

(4)

where the estimated components in the weight can be denoted as

Now we discuss estimation of the weight components. We will first address the situation that Z takes discrete values, and then briefly discuss possible modification for continuous covariates. Since is independent of T and D given Z, can be estimated by the Kaplan-Meier estimator based on data

or

with. We will utilize some analytic properties of the chosen AC model to derive an explicit expression of. Denote,

and

. It follows that

We suggest to estimate by applying the estimators in [17] for quantities in the right-hand side of the above expression. Specifically is the Kaplan-Meier estimator of based on

, where,

is the copula-graphic estimator

where the estimator is the root of the following estimating equation,

where, , ,

, , is a weight function, , and

where. Then

(5)

This estimator is then used in estimating Equation (4).

The Equation (4) may not be continuous so that an exact solution may not exist. Here we define as a generalized solution as in [13,18]. By the monotonic property of (4), the set of generalized solutions is convex. Using the arguments in [13], the solution of (4) can be reformulated as the minimizer of the following function,

where M is a large enough positive value to bound

and from above.

We suggest using a re-sampling approach for variance estimation since the analytic formula for the variance of is complicated to calculate. Based on the nonparametric bootstrap approach, we can sample replications from the original data. Given a bootstrap sample, we can compute. Repeating the re-sampling procedure B times, we obtain

and the variance of can be estimated by

where. Furthermore, we can construct the confidence interval for as

, where, and

is the cumulative distribution function of a standard normal random variable. The bootstrap percentile method suggests another way of constructing a confidence interval of with the formula

, where,

are the order statistics of for.

3.2. Asymptotic Properties for Discrete Covariates

We establish the uniform consistency and weak convergence of the proposed estimator for, a region that is identifiable. We first state the regularity conditions.

(C1) Denote the set of possible covariate Z values as which is a compact set in. The probability density function for covariate Z is uniformly bounded above and below on.

(C2) There exists a compact set in the parameter space for the copula parameter such that all true values of are interior points of for all.

(C3) There exists such that, , and

.

(C4) 1) is Lipschitz continuous for;

2) The density is bounded above uniformly for and; 3) The copula generator function has continuous derivatives,

, , and which do not equal 0 for all and.

(C5) eigmin, for some

and, where,

and for a vector u.

Condition C1 assumes the boundedness of covariates and is satisfied for finite discrete covariates. This assumption is only used to derive the asymptotic properties of for proving Theorem 1. Condition C2 assumes that the true value of is an interior point in the parameter space which is a common regularity condition. Condition C3 is assumed to simplify theoretical arguments similar to condition C1 in [13], and generally is the study end time in practical applications. Conditions C4 1) and 2) assume the smoothness of coefficient processes, and the uniform boundedness on the density of T, which are standard for quantile regression methods. Condition C4 3) imposes the smoothness requirement on the copula generator function similar to the regularity conditions in [17,19]. Condition C5 is similar to condition C4 in [13] which ensures the identifiability of and is needed for proving the consistency of.

Therefore with finite, we prove the following result.

Theorem 1 If conditions C1-C5 hold, then

and converges weakly to a meanzero Gaussian process.

The detailed proofs are presented in the Appendix.

3.3. Model Checking and Model Diagnosis

Motivated by the work of [20-22] in which complete data are considered, we define the residual quantities as

for and consider

where is a known bounded weight function. Similar to the arguments in [13,23], converges weakly to a zero-mean Gaussian process if model (1) is specified correctly and the covariate takes discrete values. Therefore we propose the following test statistic

where is an estimator of the standard deviation of which can be obtained by applying the bootstrap approach mentioned earlier. Thus, we have that Tn converges to the standard normal random variable asymptotically as the model is correct. On the other hand, when the model is mis-specified, Tn will deviate from zero. Accordingly we can reject the model assumption if, where is the quantile of and is the level of significance. If there are K candidate models under consideration, we compute the absolute value of Tn for each model for and choose the one with the smallest value.

3.4. Estimation for Continuous Covariates

We briefly discuss how to extend our estimation method for continuous covariates. One can apply a smoothing approach to estimate the probability functions conditional on z. Following [24], without loss of generality, assume that and are ordered. Let

where, is the bandwidth and is the kernel. Then

where are the rearrangement sorted according to Wi, where and

, and is the copula-graphic estimator in [24]

(6)

and solves estimating equation

Special techniques are needed to derive the asymptotic properties for the case of continuous covariates. For example properties of the smoothed versions of and are not fully available yet. The convergence rate for the normality proof may not be directly extended since the smooth version of may not be asymptotic normal. However the estimator for the quantile regression parameter may still be asymptotic normal even when some component converges at a slower rate.

4. Simulation Studies

We conduct simulation studies to examine the finitesample performance of the proposed methods with R software. Here we consider two cases. For the first one, we consider the model,

(7)

where and. We generate which follow the Clayton copula and Frank copula with marginally following so that, and D marginally following exp(2). For the second case, we consider

(8)

where, and generated from the Clayton copula and Frank copula with following and. In this case,. Three levels of association τ = 0.3, 0.5, 0.7 are considered. The censoring variable C follows a uniform distribution on.

We evaluate the performances for γ = 0.1, 0.3, 0.5 and the sample size n = 100 based on 400 simulation runs. To obtain the standard error of the proposed estimator, we use the bootstrap method with B = 50. Based on the settings, we also present a naive estimator of, which is constructed under the wrong assumption that T is independently censored by. That is, we estimate by solving the estimating Equation (4) with

Tables 1-4 report the average bias of the proposed

Table 1. Finite-sample results for estimating the quantile regression parameters under model (7) with Clayton copula.

The results are based on 400 simulation runs each with a sample size 100.

Table 2. Finite-sample results for estimating the quantile regression parameters under model (8) with Clayton copula.

The results are based on 400 simulation runs each with a sample size 100.

Table 3. Finite-sample results for estimating the quantile regression parameters under model (7) with Frank copula.

The results are based on 400 simulation runs each with a sample size 100.

Table 4. Finite-sample results for estimating the quantile regression parameters under model (8) with Frank copula.

The results are based on 400 simulation runs each with a sample size 100.

point estimator, , (Bias);

the empirical standard deviation,

where, (EmpSd); the mean squared error, Bias2 + EmpSd2, (MSE); and the coverage probability of the 95% confidence intervals,

where is the estimated standard deviation of by the bootstrap approach, (CP). From the results, we can see that our proposed estimator has much smaller bias and smaller mean squared error than the naive estimator. The confidence intervals coverage probabilities are close to the nominal level 95% in most cases while the naive estimator has the coverage rate far below the nominal level in many cases. Although the proposed estimator of has the coverage rate lower than 90% in the first case with Kendall’s tau τ = 0.7 but it still performs better than the naive estimator. As the sample size increases to n = 200 for that case (data omitted here), the coverage probabilities for proposed estimator become close to the nominal level while the coverage probabilities for the naive estimator get worse. This confirms that our estimator is asymptotically correct while the naive estimator is not.

Then we examine the proposed model diagnostic method when the true model is generated from

where, , and

so that and

follow Clayton copula with. We consider  τ = 0.3, 0.5, 0.7 and γ = 0.1, 0.3, 0.5 under n = 100 based on 200 replications.

Three forms of transformation are fitted: 1) ; 2); 3). Table 5

presents the rejection probabilitywhere α = 0.05, and the probability that the fitted model is selected as the one which gives the smallest value of among the three candidates. From the results, we see that when, the rejection probability (type-I error rate) is close to the specified level of α = 0.05. When the fitted model is wrong, the rejection probability (power of the test) is very high in most cases.

Table 5. Finite-sample results for the proposed model checking method.

Note: The sample size is 100 and replications are 200. “Power” =, where. “Selection rate” is the proportion that the fitted model is selected as the one giving the smallest value of among the three candidates.

Even for the case where the power is relatively low around 40% (the γ = 0.1 quantile for), the probabilities of selecting the correct model are still high.

5. Data Analysis

We apply the proposed methodology to analyze the bone marrow transplant data based on 137 leukemia patients provided by [1]. Patients were classified into three risk categories: ALL, AML low-risk, and AML high-risk based on their status at the time of transplantation. The covariates (Z1, Z2) are coded as ALL (Z1 = 1, Z2 = 0), AML low-risk (Z1 = 1, Z2 = 0), and AML high-risk (Z1 = 0, Z2 = 1). We want to investigate how the risk classification is related to the quantile of the relapse time. Specifically the fitted model is given by

(9)

The results are summarized in the Tables 6 and 7 based on B = 1000 bootstrap replications. Table 6 contains the estimators and model checking tests with . The p-value is the testing result by the model checking approach provided in SubSection 3.3. Since all the p-values are greater than 0.05, we adopt the model in (9) for further analysis.

From the analysis we see that patients of AML lowrisk had longer relapse time than those in the other two groups and the difference is more obvious for those with earlier relapse. For example, the 10% quantile of the relapse time in the AML low-risk group is 3.2964 times of that in ALL group and 4.751 times of that in AML highrisk group. The group differences are statistically significant for the 10% and 30% quantiles. but no longer significant for the 50% quantile.

6. Concluding Remarks

In this paper, we consider quantile regression analysis for analyzing the failure-time of a non-terminal event under the semi-competing risks setting. The Archimedean copula assumption is adopted to specify the dependency between the two correlated events. This assumption is utilized to calculate the weight for bias correction in the estimation of quantile regression parameters. Here we focus on the case of discrete covariates and derive the asymptotic properties of the proposed estimators. The bootstrap method is suggested for variance estimation. For checking the adequacy of the fitted model, a model diagnostic approach is proposed. Simulation results confirm that the proposed methods have good performances in finite samples. In the data analysis, we see that the risk classification is particularly influential for earlier relapse. The methodology can be extended to allow for continuous covariates by employing some smoothing techniques but the corresponding theoretical analysis is beyond the scope of the paper.

Table 6. Estimation of quantile regression parameters and model checking test based on the bone marrow transplant data.

Table 7. Comparison of leukemia relapse time for the three risk groups.

7. Acknowledgement

This paper was financially supported by the National Science Council of Taiwan (NSC100-2118-M-194-003- MY2).

REFERENCES

1. J. P. Klein and M. L. Moeschberger, “Survival Analysis: Techniques for Censored and Truncated Data,” 2nd Edition, Springer, New York, 2003.
2. J. Powell, “Least Absolute Deviations Estimation for the Censored Regression Model,” Journal of Econometrics, Vol. 25, No. 3, 1984, pp. 303-325. doi:10.1016/0304-4076(84)90004-6
3. J. Powell, “Censored Regression Qunatiles,” Journal of Econometrics, Vol. 32, No. 1, 1986, pp. 143-155. doi:10.1016/0304-4076(86)90016-3
4. B. Fitzenberger, “A Guide to Censored Quantile Regressions,” In: G. S. Maddala and C. R. Rao, Eds., Handbooks of Statistics, North-Holland, Amsterdam, 1997, pp. 405-437. doi:10.1016/S0169-7161(97)15017-9
5. M. Buchinsky and J. Hahn, “A Alternative Estimator for Censored Quantile Regression,” Econometrica, Vol. 66, No. 3, 1998, pp. 653-671. doi:10.2307/2998578
6. Z. Ying, S. H. Jung and L. J. Wei, “Survival Analysis with Median Regression Models,” Journal of the American Statistical Association, Vol. 90, No. 429, 1995, pp. 178-184. doi:10.1080/01621459.1995.10476500
7. S. Yang, “Censored Median Regression Using Weighted Empirical Survival and Hazard Functions,” Journal of the American Statistical Association, Vol. 94, No. 445, 1999, pp. 137-145. doi:10.1080/01621459.1999.10473830
8. S. Portnoy, “Censored Regression Quantiles,” Journal of the American Statistical Association, Vol. 98, No. 464, 2003, pp. 1001-1012. doi:10.1198/016214503000000954
9. L. Peng and Y. Huang, “Survival Analysis Based on Quantile Regression Models,” Journal of the American Statistical Association, Vol. 103, No. 482, 2008, pp. 637- 649. doi:10.1198/016214508000000355
10. G. Yin, D. Zeng and H. Li, “Power Transformed Linear Quantile Regression with Censored Data,” Journal of the American Statistical Association, Vol. 103, No. 483, 2008, pp. 1214-1224. doi:10.1198/016214508000000490
11. S. Portnoy and G. Lin, “Asymptotics for Censored Regression Quantiles,” Journal of Nonparametric Statistics, Vol. 22, No. 1, 2010, pp. 115-130. doi:10.1080/10485250903105009
12. J. P. Fine, H. Jiang and R. Chappell, “On Semi-Competing Risks Data,” Biometrika, Vol. 88, No. 4, 2001, pp. 907-919.
13. L. Peng and J. P. Fine, “Competing Risks Quantile Regression,” Journal of the American Statistical Association, Vol. 104, No. 488, 2009, pp. 1440-1453. doi:10.1198/jasa.2009.tm08228
14. R. Li and L. Peng, “Quantile Regression for Left-Truncated Semicompeting Risks Data,” Biometrics, Vol. 67 No. 3, 2011, pp. 701-710. doi:10.1111/j.1541-0420.2010.01521.x
15. A. J. Patton, “Modeling Asymmetric Exchange Rate Dependence,” International Economic Review, Vol. 47, No. 2, 2006, pp. 527-556. doi:10.1111/j.1468-2354.2006.00387.x
16. W. Wang, “Estimating the Association Parameter for Copula Models under Dependent Censoring,” Journal of the Royal Statistical Society, Series B, Vol. 65, No. 1, 2003, pp. 257-273. doi:10.1111/1467-9868.00385
17. L. Lakhal, L.-P. Rivest and B. Abdous, “Estimating Survival and Association in a Semicompeting Risks Model,” Biometrics, Vol. 64, No. 1, 2008, pp. 180-188. doi:10.1111/j.1541-0420.2007.00872.x
18. M. Fygenson and Y. Ritov, “Monotone Estimating Equations for Censored Data,” The Annals of Statistics, Vol. 22, No. 2, 1994, pp. 732-746. doi:10.1214/aos/1176325493
19. L.-P. Rivest and M. Wells, “A Martingale Approach to the Copula-Graphic Estimator for the Survival Function under Dependent Censoring,” Journal of Multivariate Analysis, Vol. 79, No. 1, 2001, pp. 138-155. doi:10.1006/jmva.2000.1959
20. J. X. Zheng, “A Consistent Nonparametric Regression Models under Quantile Restrictions,” Econometric Theory, Vol. 14, No. 1, 1998, pp. 123-138. doi:10.1017/S0266466698141051
21. J. L. Horowitz and V. G. Spokoiny, “An Adaptive, RateOptimal Test of Linearity for Median Regression Models,” Journal of the American Statistical Association, Vol. 97, No. 459, 2002, pp. 822-835. doi:10.1198/016214502388618627
22. X. He and L. Zhu, “A Lack-of-Fit Test for Quantile Regression,” Journal of the American Statistical Association, Vol. 98, No. 464, 2003, pp. 1013-1022. doi:10.1198/016214503000000963
23. D. Y. Lin and L. J. Wei and Z. Ying, “Checking the Cox Model with Cumulative Sums of Martingle-Based Residuals,” Biometrika, Vol. 80, No. 3, 1993, pp. 557-572. doi:10.1093/biomet/80.3.557
24. R. Braekers and N. Veraverbeke, “A Copula-Graphic Estimator for the Conditional Survival Function under Dependent Censoring,” The Canadian Journal of Statistics, Vol. 33, No. 3, 2005, pp. 429-447. doi:10.1002/cjs.5540330308

Appendix: Proofs of Theorem 1

The proof follow the outline similar to the proof in [13]. The technical details need to be adjusted for dependent censoring which make things harder.

Define

For simplicity, we use and to denote supremum taken over, and respectively.

First, we show that converges uniformly to

. Since is finite, as

. Hence by [17], is consistent for for all. Using this and conditions C2, it implies that with probability 1 for large enough n. From condition C3, and converge to

and uniformly for. Condition C4 (iii) together with conditions C1 and C3 ensure the uniform boundedness of the first two derivatives of and for and

, same as the regularity condition in [19]. Hence as, by [17] and [19], also converges to uniformly for. Then there exists a number such that and fall into with probability 1 for large enough n and all by condition C3 and the uniform convergence of the two estimators. Denote. Condition C4 (iii)

implies that, and

are all uniformly bounded above for and. Hence

converges to

uniformly for and. This result and the uniform convergence of imply the uniform convergence of for and. Hence we have

The function class

is Donsker because the class of indicator functions is Donsker and both and are uniformly bounded by conditions C1 and C3. Therefore, by Glivenko-Cantelli theorem, . Also

(1)

Then the consistency of comes from the identifiability condition C5 using the arguments in the proof of Theorem 1 in [13].

Similar to [13] the following lemma holds with the uniform boundedness of Z, and that comes from conditions C1, C3 (i) and C3 (ii).

Lemma 1. For any positive sequence satisfying,

Now, we provide the proof for the asymptotic normality of. One can write

From the uniform convergence and asymptotic weak convergence of and condition C3, we have for any r > 0. Hence the above quantity is dominated by the first term. By Lemma 1 and the uniform convergence of to,

where denotes asymptotic equivalence uniformly in. Applying Taylor expansions for at, and using the uniform convergence of to, we have

where. Since,

(2)

It remains to prove the weak convergence of to a zero-mean Gaussian process. It follows that

The first two terms (I) and (II) can be proved to converge to zero-mean Gaussian processes by applying the arguments in [13] as follows. The family

is Donsker by the Lipschitz continuity of (condition C4 (i)) and uniformly boundedness of and (conditions C1 and C3). Thus, the first term

converges weakly to a zero-mean Gaussian process.

Denote and as the at-risk processes at time t. Let

,

,

..

Then is a martingale.

From martingale representation theory for univariate independent censoring,

So the second term can be written as

where.

From uniform boundedness of, and, it is easy to show that

is Lipschitz in b. Then similarly can be shown to be Donsker, and the second term also converges weakly to a zero-mean Gaussian process.

For the third term, recall

with. Denote

,

and

.

So for,

For notation brevity, denote

,

and

.

The third term becomes

Since is finite, by condition C1 for all. So with follows a Gaussian distribution. Let

Now term is

which follows a Gaussian distribution by the boundedness of and.

Denote

,

.

Denote

where

.

.

Then

is a martingale. Then

Now similar to the martingale expression for the second term, we have term as

where

.

Similar as (II) above, is Lipschitz in b, and is Donsker. Thus the term (B) converges weakly to a zero-mean Gaussian process.

Finally, for term, let

be the crude hazard function in [17],

.

.

Then

is a martingale. From [17], we have

where

Similar to terms (A) and (B), we have term (C) equals

where

Then each of the three terms in (C) can be shown to converge to zero-mean Gaussian process similarly as above.

Summarizing the results above, converges weakly to a zero-mean Gaussian process. Hence (2) implies that converges weakly to a zero-mean Gaussian process for.