Open Journal of Statistics
Vol.06 No.06(2016), Article ID:73025,17 pages
10.4236/ojs.2016.66092
Imputation Based on Local Linear Regression for Nonmonotone Nonrespondents in Longitudinal Surveys
Sarah Pyeye1, Charles K. Syengo1, Leo Odongo2, George O. Orwa3, Romanus O. Odhiambo3
1Pan African University Institute for Basic Sciences, Technology and Innovation, Nairobi, Kenya
2Department of Statistics and Actuarial Science, Kenyatta University, Nairobi, Kenya
3Department of Statistics and Actuarial Science, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Copyright © 2016 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Received: October 13, 2016; Accepted: December 20, 2016; Published: December 27, 2016
ABSTRACT
The study focuses on the imputation for the longitudinal survey data which often has nonignorable nonrespondents. Local linear regression is used to impute the missing values and then the estimation of the time-dependent finite populations means. The asymptotic properties (unbiasedness and consistency) of the proposed estimator are investigated. Comparisons between different parametric and nonparametric estimators are performed based on the bootstrap standard deviation, mean square error and percentage relative bias. A simulation study is carried out to determine the best performing estimator of the time-dependent finite population means. The simulation results show that local linear regression estimator yields good properties.
Keywords:
Longitudinal Survey, Nonmonotone, Nonresponse, Imputation, Nonparametric Regression
1. Introduction
Longitudinal surveys refer to a type of sampling surveys done repeatedly over time on the same sampled units. In such surveys, data which are rich in information about the specific sampled unit can be obtained and thus suitable for various purposes. While longitudinal surveys are regarded to be better and reliable in informing about various features of a study unit, they suffer from monotone and intermittent patterns of missing data. This is often as a result of inaccessibility to or deliberate refusal of respondents to provide information after having participated in the surveys thus the occurrence of nonresponses.
Missing data are a problem because nearly all standard statistical methods presume complete information for all the variables included in the analysis. Using data with missing values leads to reduction in sample size which significantly affects the precision of the confidence interval, statistical power reduce and biased population parameter estimates. Imputation is one of the approaches used to intuitively fill in these missing values. Over time, various imputation models have been developed and they have been used to overcome quite a number of challenges caused by missing data. However, some shortcomings still exist such as biasedness and inefficiency of estimators. This is because imputation models have different assumptions in both parametric and nonparametric contexts.
Parametric methods like maximum likelihood estimation have limitations like sensitivity to model misspecification while nonparametric methods are more robust and flexible [1] . Some of the methods used by [2] are simple linear regression imputation and Nadaraya-Watson technique. From their simulation results, it was found that the simple linear regression imputation approach has the weakness of producing biased estimates even when the responses at a particular time (including previous values) are correctly specified. On the other hand, Nadaraya-Watson technique of [3] and [4] used in the imputation of missing values in the longitudinal data has some weaknesses of producing a large design bias and boundary effects that give unreliable estimates for inference.
As shown by [5] and [6] , a rival for Nadaraya-Watson technique is the local linear regression estimator which was found to produce unbiased estimates without boundary effects. [7] studied the weighted Nadaraya-Watson method and was concerned with the limitations of the method such as consistency, asymptotic normality and the interior and boundary point effects. In his study, he found that local linear regression is much better than the weighted Nadaraya-Watson method as it produces asymptotically unbiased estimates without boundary effects. Moreover, [8] also found that the local linear regression estimator (introduced by [9] ) has desirable properties.
In order to overcome the limitations of Nadaraya-Watson estimator, we derive a local linear regression estimator in the imputation of the nonresponndents in a longitudinal data set. The asymptotic properties (unbiasedness and consistency) of the proposed estimator are investigated. Comparisons between various estimators (parametric and nonparametric) are performed based on the bootstrap standard deviation, mean square error and percentage relative bias. A simulation study is conducted to determine the best performing estimator of the finite population mean.
2. Assumptions and Notations
1) All sampled units are observed on the first time point and remain in the sample till the final time. The variable of interest is the value of y for the unit at time point t.
2) The prediction process is past last value dependent and the vectors
are independently and identically distributed (i.i.d) from the superpopulation under the model-assisted approach.
For and and the response indicator function is
(1)
3) The vector follows the Markov chain for longitudinal survey data without missing values
(2)
4) We assume that the population P is divided into a fixed number of imputation classes, which are basically unions of some small strata.
3. Regularity Conditions
Denote f to be a probability density function (pdf) of X and where is defined by;
(3)
and g and f have bounded second derivatives
i) The Kernel function K is a bounded and twice continuously differentiable symmetric function on the interval, and such that,
, , and.
ii) The regression function is at least twice continuously differentiable every- where in the neighborhood of.
iii) The sample survey variable of interest has a finite second moment bounded on the interval. Thus.
iv) The conditional variance is bounded and continuous.
4. Methodology
4.1. Imputation Process
Considering the case of the last past value, we do impute for missing value by the value obtained through the prediction procedure. But according to [10] , the joint distribution of bivariate random variables () is preserved when the missing value, Y is imputed by the conditional distribution of Y given X. Therefore, considering the conditional mean imputation approach for the single imputation.
Let
(4)
be the conditional expectation with respect to the superpopulation for unobserved value with observed value for.
It is therefore clear that when is known, then the imputed value of unobserved is given by. In cases where in Equation (4) is unknown, for nonmonotone nonrespondents, we employ the last value dependent mechanism.
Under assumption (2), we have
(5)
Using Equation (4), we are limited to do estimation by regressing the nonrespondents on the observed values based on the longitudinal survey data, therefore, we apply the equivalent Equation (5) which allows estimation using data from all subjects having observed and observed. Then, the imputation of the nonrespondents is done using in Equation (5) and under the last value dependent assumption, we are able to use auxiliary survey data in regression fitting. According to [11] , imputing nonresponses using (5) was done for monotone case and their approach is easy to apply if the conditional expectation say, in (4) has a linear relationship with x. Adopting the concept of nonparametric method in [12] , here, the local linear estimator of is. Let be the variable of interest for the i-th unit at time t where and. Associated with each are the known, , of q auxiliary variables. To make the notations and writings simple, we relax the index t and write with a single subscript i, thus is written as.
The regression imputation model is given by the relation
(6)
such that’s are residuals which are assumed to be independently normally distri- buted with mean zero and variance.
It is clear that
(7)
(8)
where is an unknown regression function which is a smooth function of x.
To obtain the estimator of at and its derivatives, we use the weighted local polynomial fitting by assuming that the regression function with derivatives at a point, say, exists and are continuous.
We can rewrite the imputation model (6) as
(9)
where approximation of about is done following the Taylor series expansion.
The kernel weight given as
(10)
where h is the bandwidth and K is the kernel function which should be strictly positive and controls the weights, is the point of focus and being the covariate with design matrix centered at past last value and j is the order of the local polynomial.
Let
(11)
Accordingly, for,
(12)
Equation (12) is the Nadaraya-Watson estimator.
With estimator the, the conditional expectation given by is used to impute the missing values, i.e.
(13)
where is the survey weight and
(14)
Similarly for,
(15)
Minimizing S with respect to and in Equation (15) and solving for and, we get
(16)
and
(17)
Defining:
and
, Thus:
Using, in Equation (17), we obtain
(18)
and with, in Equation (17), it yields
(19)
Similarly, using, in Equation (16) gives
(20)
and with, Equation (16) becomes
(21)
The local linear estimator for the regression function is now given by:
(22)
Substituting for (from Equation (20)) and (from Equation (18)) in Equation (22) gives,
(23)
With estimator, , the conditional expectation given by is used to impute the missing values, i.e.
(24)
where, is the weight according to the survey design and is as defined earlier.
4.2. Estimation of the Finite Population Means Using the Imputed Data
In this study, we consider a finite population from which samples are drawn. Before estimation of the population parameters, imputation process is done. Suppose that the survey measurements are on the variables respectively and a simple random sample without replacement, , of size n is selected from a finite population, P of size N. The sample consists of two parts: and, where is the set of all respondents in the survey and is the set of all non-respondents. The missing observations of the sample unit, for are considered. Impu- tation of the missing value for and is done and then a complete data set is produced which is then used in the estimation of finite population means.
Let be the finite population mean at time point, t for
The value to be imputed for the non respondent is denoted by such that the imputed data is given as
(25)
The mean of the finite population is given by
Now, using the imputed data, the estimator of the finite population total is the sample total of the imputed data denoted by and is given by
(26)
Thus, using the imputed data, the estimator of the finite population mean is the sample mean of the imputed data denoted by, given by
(27)
Assuming that for each
(28)
for each.
The imputed values are treated as if they were observed such that both observed and the imputed are used in the estimation of the population mean:
Sample mean for the imputed data becomes
(29)
Note that the same weight due to sampling design is used in Equation (29) for all units in the sample.
(30)
for.
Since t is used as a constant variable, Equation (30) is re-written as
(31)
As for [12] , the local constant estimation for the nonrespondents in Equation (31) is obtained as:
(32)
and the local linear estimation for the nonrespondents, in Equation (31) is given by:
(33)
Clearly, in Equation (31) is substituted by Equation (32) and Equation (33) for use of local constant estimator and local linear regression estimator respectively.
5. Asymptotic Properties of the Estimator
In the derivation of the asymptotic properties, we use the set of regularity conditions. According to [12] , the asymptotic theory development is provided by the concept of a sequence of finite populations with strata in. It is assumed that there is a sequence of finite populations and the corresponding sequence of samples. The finite population P indexed by is assumed to be a member of the sequence of the populations. The sample size denoted by and the population size denoted by approach infinity as. The uniform response and the size of the nonrespon-
dents set satisfy the condition. All limiting processes will be under-
stood as such that the regularity conditions are satisfied. For easy notation, the subscript will be ignored in the subsequent work.
Theorem 1. Assuming the regularity conditions (i)-(iv) and also the assumptions in section 2 hold. Then under the regression imputation model, (6), the estimator, in Equation (31), is asymptotically unbiased and consistent for the population mean.
Proof. 1) Bias of.
The general formula for the finite population total is given by:
(34)
where and are the sampled and the non sampled sets respectively.
Equation (34) can be decomposed as
(35)
For simplicity, denote, and by, and respec- tively throughout the remaining work.
From Equation (31), the estimator for the finite population total is given by
(36)
Now consider the difference,
(37)
(38)
Taking expectation on both sides of Equation (38), we have
(39)
Clearly, since.
Now,
(40)
(41)
Assuming such that, then in Equation (41) and hence,
(42)
But from Lemma 1 (see Appendix),
(43)
where.
Thus the bias of becomes
(44)
2) Variance of.
The variance of is given by the variance of the error term. That is,
(45)
(46)
(47)
(48)
Thus,
(49)
for sufficiently large n such that and; where
.
3) Mean square error (MSE) of.
Finally, we have
(50)
(51)
which is the asymptotic expression of the MSE for. as and, and thus is consistent.
Consequently, is asymptotically unbiased and consistent.
6. Simulation Study
6.1. Description of Longitudinal Data
In this section, a study of the finite population mean estimators based on four measures of performance (percentage relative bias (%RB), MSE and bootstrap standard deviation (SD bootstrap)) is carried out.
Simulations and computations of the finite population mean estimators were done using R (R version 3.2.3 (2015-12-10)) based on 1000 runs. For the the local linear and local constant estimators, the Gaussian kernel with a fixed bandwidth of was used. To fit the nonparametric regression, the loess function in R was used.
For comparison purposes, we used complete data as our main reference in the evaluation of the performance of the estimators (Proposed local linear estimator, local constant estimator and the simple linear regression estimator).
In this simulation study, a sample of size was considered. The longitudinal data for each of the sampled units is of size that is,. This will yield 23 different patterns of the longitudinal data with each of respondent and non- respondent values being denoted by 1 and 0 respectively at different time points.
Longitudinal data was generated according to two models:
1) In model 1, simulation of is done from a multivariate normal distribution with the means for the 4 time points as 1.33, 1.94, 2.73, 3.67 respectively and the covariance matrix following the model with standard error 1 and correlation coefficient 0.9.
2) In model 2, simulation of is done from a multivariate normal distribution with the means for the 4 time points as 1.33, 1.94, 2.73, 3.67 respectively and the covariance matrix following the model with standard error 1 and correlation coefficient 0.9.
In order to obtain the nonmonotone pattern in the simulated data, we used the predetermined unconditional probabilities of [13] shown in Table 1.
6.2. Bootstrap Variance Estimation
The following steps were used to obtain the bootstrap variance.
1) We constructed a pseudo population by replicating the sample of size 1500 times through 1000 simulation runs.
Table 1. Probabilities of nonresponse patterns for.
2) A simple random sample of size 200 was drawn with replacement from the pseudo population.
3.) We applied the simple linear regression, local constant and local linear regression imputation models to impute the missing’s of the sample.
4) Repeating the steps 2 and 3 for a large number of times () to obtain where is the analog of, for the b-th bootstrap sample.
5) Obtain the bootstrap variance of by the formula
where is the mean bootstrap analog of, given by.
6.3. Results and Discussion
The results of this simulation study are summarized in Table 2 and Table 3.
In terms of the percentage relative bias (%RB), at time point, it can be seen that the local linear estimator has the least value followed by the Nadaraya-Watson estimator and then the simple linear regression estimator, which was the largest value of %RB.
At time point, observe that the the simple linear regression estimator has the least %RB value compared to that of the local linear estimator and the Nadaraya- Watson estimator performed worst with the largest %RB. The %RB values of the local linear estimator and the simple linear regression estimator are very much closer to zero than those for the other estimators.
At time point, observe that the local linear estimator has the least %RB value followed by the simple linear regression estimator and the Nadaraya-Watson estimator performed worst. Through comparisons based on %RB with reference to the complete data, the local linear estimator has its %RB values approaching zero.
In terms of MSE, at time point, Nadaraya-Watson estimator has the least values followed by the local linear estimator and lastly the simple linear regression estimator which has the largest values. At time point, the local linear estimator has the least values of MSE followed by the simple linear regression estimator and lastly
Table 2. Simulated results for mean estimation (normal case).
Table 3. Simulated results for mean estimation (log-normal case).
the Nadaraya-Watson estimator which has the largest MSE value. At time points, Nadaraya-Watson estimator has the least values of MSE followed by the simple linear regression estimator and lastly the local linear estimator which has the largest MSE value.
In terms of the bootstrap standard deviation, it can be seen that the local linear estimator performs the best at all the three time points, , and in which its values are even lower than those of the complete data implying that the results got with the local linear estimator are the best. The simple linear regression and Nadaraya-Watson estimators are competing interchangeably in terms of performance for the bootstrap samples.
In terms of the percentage relative bias (%RB), at time points and, observe that the simple linear regression estimator has the least %RB values followed by the local linear estimator and the Nadaraya-Watson estimator has the biggest %RB values. Based on these aforementioned results, it is viable to choose the best estimator as the local linear estimator which handles both linear and nonlinear models. At time points, observe that the local linear estimator has the least %RB value followed by simple linear regression estimator and lastly the Nadaraya-Watson. This implies that, for, the local linear estimator has the smallest bias close to zero as for the complete data, hence the best estimator compared to others.
In terms of the MSE, at time points and, Nadaraya-Watson estimator has the least values of MSE, followed by the simple linear regression estimator and lastly the local linear estimator which has the largest values of MSE. At time point, the the local linear estimator has the least values implying that it performed well at time point.
In terms of the bootstrap standard deviation, observe from Table 3 that the local linear estimator performs the best at all the three time points since it has the least bootstrap standard deviations and these values are even smaller than those of the complete data in order of increasing time.
From Table 3 of results, it is can be seen that the bootstrap standard deviations of the local linear estimator are more close to those of the Nadaraya-Watson estimator than the simple linear regression estimator.
7. Conclusion
Generally, nonrespondents in any survey data has a significant impact on the bias and the variance of the estimators and therefore, before using such data in statistical inference, imputation with an appropriate technique ought to be done. In this study, the main objective was to obtain an imputation method based on local linear regression for nonmonotone nonrespondents in longitudinal surveys and determine its asymptotic properties. Comparing the parametric and nonparametric methods, nonparametric methods performed better than the parametric methods. This was evident from the MSE and %RB values in both the normal and log-normal data. Among the nonpara- metric methods, the local linear estimator was the best estimator as it behaved better than the Nadaraya-Watson estimator in terms of %RB. In terms of the bootstrap standard deviation, the local linear estimator performs the best at all the three time points since it has the least bootstrap standard deviations for the two data sets. Generally, the local linear estimator performs relatively well and in particular in the normal data. We conclude that use of the nonparametric estimators seem plausible in both theoretical and practical scenarios.
Acknowledgements
We wish to thank the African Union Commission for fully funding this research.
Cite this paper
Pyeye, S., Syengo, C.K., Odongo, L., Orwa, G.O. and Odhiambo, R.O. (2016) Imputation Based on Local Linear Regression for Nonmonotone Nonrespondents in Longitudinal Surveys. Open Journal of Statistics, 6, 1138-1154. http://dx.doi.org/10.4236/ojs.2016.66092
References
- 1. Dorfman, A.H. (1992) Nonparametric Regression for Estimating Totals in Finite Population. Proceeding Section of Survey Methodology. American Statistical Association Alexandria, VA, 622-625.
- 2. Xu, J., Shao, J., Palta, M. and Wang. L. (2008) Imputation for Nonmonotone Last-Value-Dependent Nonrespondents in Longitudinal Surveys. Survey Methodology, 34, 153-162.
- 3. Nadaraya, E.A. (1964) On Estimating Regression. Theory of Probability and Its Applications, 9, 141-142.
- 4. Watson, G.S. (1964) Smooth Regression Analysis. Sankhy: The Indian Journal of Statistics, 26, 359-372.
- 5. Hastie, T.J. and Loader, C. (1993) Local Regression: Automatic Kernel Carpentry (with Discussion). Statistical Science, 8, 120-143.
- 6. Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. Chapman & Hall, London.
- 7. Cai, Z. (2001) Weighted Nadaraya-Watson Regression Estimation. Statistics & Probability Letters, 51, 307-318.
- 8. Fan, J. and Gijbels, I. (1996) Local Polynomial Modelling and Its Applications. Chapman and Hall, London.
- 9. Stone, C.J. (1977) Consistent Nonparametric Regression. The Annals of Statistics, 3, 595-620.
- 10. Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc., New York.
https://doi.org/10.1002/9780470316696 - 11. Paik, M.C. (1997) The Generalized Estimating Equation Approach When Data Are Not Missing Completely at Random. Journal of American Statistical Association, 92, 1320-1329.
- 12. Cheng, P.E. (1994) Nonparametric Estimation of Mean Functionals with Data Missing at Random. Journal of the American Statistical Association, 89, 81-87.
- 13. Shao, J., Klein, M. and Xu, J. (2012) Imputation for Nonmonotone Nonresponse in the Survey of Industrial Research and Development. Survey Methodology, 38, 143-155.
- 14. Masry, E. (1996) Multivariate Local Polynomial Regression for Time Series. Uniform Strong Consistence and Rates. Journal of Time Series Analysis, 17, 571-599.
Appendix
LEMMA 1. The bias of is given by
(52)
Under the regularity conditions in section 3, as and
.
PROOF OF LEMMA 1.
Proof. From Equation (23),
(53)
where,
, where.
The expectation of is given by
(54)
(55)
The bias of is therefore given by
(56)
For fixed design points of’s on the interval, the expression
almost everywhere, see [14] .
Now,
1)
2)
3)
4)
Equation (56) becomes
(57)
Letting
Hence, the bias of can be re-written as
(58)
and hence the result.
LEMMA 2. The asymptotic expression of the variance of is given by
(59)
as and; where.
PROOF OF LEMMA 2.
Proof. Using Equation (23),
(60)
since.
It follows that
(61)
where
(62)
and
as
Thus,
(63)
The asymptotic expression of the variance of becomes
(64)
where. Hence the result.
MSE of
From LEMMA 1 and 2, the MSE of becomes
(65)
Submit or recommend next manuscript to SCIRP and we will provide best service for you:
Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.
A wide selection of journals (inclusive of 9 subjects, more than 200 journals)
Providing 24-hour high-quality service
User-friendly online submission system
Fair and swift peer-review system
Efficient typesetting and proofreading procedure
Display of the result of downloads and visits, as well as the number of cited articles
Maximum dissemination of your research work
Submit your manuscript at: http://papersubmission.scirp.org/
Or contact ojs@scirp.org