^{1}

^{1}

^{2}

^{3}

^{3}

The study focuses on the imputation for the longitudinal survey data which often has nonignorable nonrespondents. Local linear regression is used to impute the missing values and then the estimation of the time-dependent finite populations means. The asymptotic properties (unbiasedness and consistency) of the proposed estimator are investigated. Comparisons between different parametric and nonparametric estimators are performed based on the bootstrap standard deviation, mean square error and percentage relative bias. A simulation study is carried out to determine the best performing estimator of the time-dependent finite population means. The simulation results show that local linear regression estimator yields good properties.

Longitudinal surveys refer to a type of sampling surveys done repeatedly over time on the same sampled units. In such surveys, data which are rich in information about the specific sampled unit can be obtained and thus suitable for various purposes. While longitudinal surveys are regarded to be better and reliable in informing about various features of a study unit, they suffer from monotone and intermittent patterns of missing data. This is often as a result of inaccessibility to or deliberate refusal of respondents to provide information after having participated in the surveys thus the occurrence of nonresponses.

Missing data are a problem because nearly all standard statistical methods presume complete information for all the variables included in the analysis. Using data with missing values leads to reduction in sample size which significantly affects the precision of the confidence interval, statistical power reduce and biased population parameter estimates. Imputation is one of the approaches used to intuitively fill in these missing values. Over time, various imputation models have been developed and they have been used to overcome quite a number of challenges caused by missing data. However, some shortcomings still exist such as biasedness and inefficiency of estimators. This is because imputation models have different assumptions in both parametric and nonparametric contexts.

Parametric methods like maximum likelihood estimation have limitations like sensitivity to model misspecification while nonparametric methods are more robust and flexible [

As shown by [

In order to overcome the limitations of Nadaraya-Watson estimator, we derive a local linear regression estimator in the imputation of the nonresponndents in a longitudinal data set. The asymptotic properties (unbiasedness and consistency) of the proposed estimator are investigated. Comparisons between various estimators (parametric and nonparametric) are performed based on the bootstrap standard deviation, mean square error and percentage relative bias. A simulation study is conducted to determine the best performing estimator of the finite population mean.

1) All sampled units are observed on the first time point

2) The prediction process is past last value dependent and the vectors

For

3) The vector

4) We assume that the population P is divided into a fixed number of imputation classes, which are basically unions of some small strata.

Denote f to be a probability density function (pdf) of X and

and g and f have bounded second derivatives

i) The Kernel function K is a bounded and twice continuously differentiable symmetric function on the interval

ii) The regression function

iii) The sample survey variable of interest has a finite second moment bounded on the interval

iv) The conditional variance

Considering the case of the last past value, we do impute for missing value

Let

be the conditional expectation with respect to the superpopulation for unobserved value

It is therefore clear that when

Under assumption (2), we have

Using Equation (4), we are limited to do estimation by regressing the nonrespondents

The regression imputation model

such that

It is clear that

where

To obtain the estimator of

We can rewrite the imputation model (6) as

where approximation of

The kernel weight given as

where h is the bandwidth and K is the kernel function which should be strictly positive and

Let

Accordingly, for

Equation (12) is the Nadaraya-Watson estimator.

With estimator the

where

Similarly for

Minimizing S with respect to

and

Defining:

Using

and with

Similarly, using

and with

The local linear estimator for the regression function

Substituting for

With estimator,

where

In this study, we consider a finite population from which samples are drawn. Before estimation of the population parameters, imputation process is done. Suppose that the survey measurements are

Let

The value to be imputed for the non respondent is denoted by

The mean of the finite population is given by

Now, using the imputed data, the estimator of the finite population total is the sample total of the imputed data denoted by

Thus, using the imputed data, the estimator of the finite population mean is the sample mean of the imputed data denoted by

Assuming that for each

for each

The imputed values are treated as if they were observed such that both observed and the imputed are used in the estimation of the population mean:

Sample mean for the imputed data becomes

Note that the same weight due to sampling design is used in Equation (29) for all units in the sample.

for

Since t is used as a constant variable, Equation (30) is re-written as

As for [

and the local linear estimation for the nonrespondents,

Clearly,

In the derivation of the asymptotic properties, we use the set of regularity conditions. According to [

dents set

stood as

Theorem 1. Assuming the regularity conditions (i)-(iv) and also the assumptions in section 2 hold. Then under the regression imputation model

Proof. 1) Bias of

The general formula for the finite population total is given by:

where

Equation (34) can be decomposed as

For simplicity, denote

From Equation (31), the estimator for the finite population total is given by

Now consider the difference,

Taking expectation on both sides of Equation (38), we have

Clearly,

Now,

Assuming

But from Lemma 1 (see Appendix),

where

Thus the bias of

2) Variance of

The variance of

Thus,

for sufficiently large n such that

3) Mean square error (MSE) of

Finally, we have

which is the asymptotic expression of the MSE for

Consequently,

In this section, a study of the finite population mean estimators based on four measures of performance (percentage relative bias (%RB), MSE and bootstrap standard deviation (SD bootstrap)) is carried out.

Simulations and computations of the finite population mean estimators were done using R (R version 3.2.3 (2015-12-10)) based on 1000 runs. For the the local linear and local constant estimators, the Gaussian kernel with a fixed bandwidth of

For comparison purposes, we used complete data as our main reference in the evaluation of the performance of the estimators (Proposed local linear estimator, local constant estimator and the simple linear regression estimator).

In this simulation study, a sample of size ^{3} different patterns of the longitudinal data with each of respondent and non- respondent values being denoted by 1 and 0 respectively at different time points.

Longitudinal data was generated according to two models:

1) In model 1, simulation of

2) In model 2, simulation of

In order to obtain the nonmonotone pattern in the simulated data, we used the predetermined unconditional probabilities of [

The following steps were used to obtain the bootstrap variance.

1) We constructed a pseudo population by replicating the sample of size 1500 times through 1000 simulation runs.

Pattern type | Nonresponse pattern | Normal/Log-normal data | Total Probability |
---|---|---|---|

Monotone | 1 0 0 0 | 0.062 | 0.181 |

1 1 0 0 | 0.043 | ||

1 1 1 0 | 0.076 | ||

Nonmonotone | 1 0 0 1 | 0.113 | 0.494 |

1 0 1 0 | 0.071 | ||

1 0 1 1 | 0.186 | ||

1 1 0 1 | 0.124 | ||

Complete data | 1 1 1 1 | 0.325 | 0.325 |

2) A simple random sample of size 200 was drawn with replacement from the pseudo population.

3.) We applied the simple linear regression, local constant and local linear regression imputation models to impute the missing

4) Repeating the steps 2 and 3 for a large number of times (

5) Obtain the bootstrap variance of

The results of this simulation study are summarized in

In terms of the percentage relative bias (%RB), at time point

At time point

At time point

In terms of MSE, at time point

Method | Quantity | ||||
---|---|---|---|---|---|

Complete data | Mean | 1.328918 | 1.939003 | 2.729671 | 3.66934 |

Standard deviation | 1.000342 | 1.000168 | 0.9997156 | 1.000435 | |

%RB | 0.0 | 0.0 | 0.0 | 0.0 | |

MSE | 1.001018 | 1.000666 | 0.9997697 | 1.001196 | |

SD bootstrap | 0.6667591 | 0.6666357 | 0.6666357 | 0.6675065 | |

Local Linear Regression | Mean | 1.938469 | 2.729698 | 3.669843 | |

Standard deviation | 0.9948414 | 0.9926485 | 0.9932463 | ||

%RB | 0.0003101247 | 0.004607907 | 0.003463886 | ||

MSE | 0.9900532 | 0.9857052 | 0.9868784 | ||

SD bootstrap | 0.6606914 | 0.6600272 | 0.6597972 | ||

Nadaraya-Watson | Mean | 1.938513 | 2.688752 | 3.658198 | |

Standard deviation | 0.9804571 | 0.995685 | 0.9812671 | ||

%RB | 0.002618051 | −1.49819 | −0.3079823 | ||

MSE | 0.9616402 | 0.9934076 | 0.963356 | ||

SD bootstrap | 0.9807754 | 0.9967448 | 0.9815455 | ||

Simple linear regression | Mean | 1.939073 | 2.729775 | 3.669467 | |

Standard deviation | 0.9952188 | 0.9928367 | 0.9926948 | ||

%RB | 0.003486382 | 0.003859931 | 0.003474327 | ||

MSE | 0.9908072 | 0.9860761 | 0.9857896 | ||

SD bootstrap | 0.9952162 | 0.9938139 | 0.993223 |

Method | Quantity | ||||
---|---|---|---|---|---|

Complete data | Mean | 1.330963 | 1.94061 | 2.731046 | 3.671122 |

Standard deviation | 1.000228 | 0.9999145 | 0.9998701 | 1.000415 | |

%RB | 0.0 | 0.0 | 0.0 | 0.0 | |

MSE | 1.000779 | 1.000138 | 1.000068 | 1.001156 | |

SD bootstrap | 0.6658951 | 0.6659541 | 0.6659541 | 0.6662738 | |

Local Linear Regression | Mean | 1.940391 | 2.731393 | 3.671548 | |

Standard deviation | 0.9950302 | 0.9927199 | 0.9925087 | ||

%RB | −0.006115805 | 0.001946422 | 0.003121577 | ||

MSE | 0.9904082 | 0.9858397 | 0.9854251 | ||

SD bootstrap | 0.6588623 | 0.655473 | 0.658257 | ||

Nadaraya-Watson | Mean | 1.940298 | 2.689957 | 3.660124 | |

Standard deviation | 0.9806438 | 0.9958007 | 0.9805938 | ||

%RB | −0.0109425 | −1.506794 | −0.3052104 | ||

MSE | 0.9619855 | 0.9936533 | 0.9620454 | ||

SD bootstrap | 0.9793316 | 0.9938614 | 0.9797415 | ||

Simple linear regression | Mean | 1.940518 | 2.731128 | 3.671224 | |

Standard deviation | 0.9948923 | 0.9928891 | 0.9925527 | ||

%RB | -0.004716414 | 0.002994436 | 0.002771179 | ||

MSE | 0.9901363 | 0.9861755 | 0.9855044 | ||

SD bootstrap | 0.9940906 | 0.9909141 | 0.9916702 |

the Nadaraya-Watson estimator which has the largest MSE value. At time points

In terms of the bootstrap standard deviation, it can be seen that the local linear estimator performs the best at all the three time points

In terms of the percentage relative bias (%RB), at time points

In terms of the MSE, at time points

In terms of the bootstrap standard deviation, observe from

From

Generally, nonrespondents in any survey data has a significant impact on the bias and the variance of the estimators and therefore, before using such data in statistical inference, imputation with an appropriate technique ought to be done. In this study, the main objective was to obtain an imputation method based on local linear regression for nonmonotone nonrespondents in longitudinal surveys and determine its asymptotic properties. Comparing the parametric and nonparametric methods, nonparametric methods performed better than the parametric methods. This was evident from the MSE and %RB values in both the normal and log-normal data. Among the nonpara- metric methods, the local linear estimator was the best estimator as it behaved better than the Nadaraya-Watson estimator in terms of %RB. In terms of the bootstrap standard deviation, the local linear estimator performs the best at all the three time points since it has the least bootstrap standard deviations for the two data sets. Generally, the local linear estimator performs relatively well and in particular in the normal data. We conclude that use of the nonparametric estimators seem plausible in both theoretical and practical scenarios.

We wish to thank the African Union Commission for fully funding this research.

Pyeye, S., Syengo, C.K., Odongo, L., Orwa, G.O. and Odhiambo, R.O. (2016) Imputation Based on Local Linear Regression for Nonmonotone Nonrespondents in Longitudinal Surveys. Open Journal of Statistics, 6, 1138-1154. http://dx.doi.org/10.4236/ojs.2016.66092

LEMMA 1. The bias of

Under the regularity conditions in section 3,

PROOF OF LEMMA 1.

Proof. From Equation (23),

where

The expectation of

The bias of

For fixed design points of

Now,

1)

2)

3)

4)

Equation (56) becomes

Letting

Hence, the bias of

and hence the result.

LEMMA 2. The asymptotic expression of the variance of

as

PROOF OF LEMMA 2.

Proof. Using Equation (23),

since

It follows that

where

and

as

Thus,

The asymptotic expression of the variance of

where

MSE of

From LEMMA 1 and 2, the MSE of

Submit or recommend next manuscript to SCIRP and we will provide best service for you:

Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.

A wide selection of journals (inclusive of 9 subjects, more than 200 journals)

Providing 24-hour high-quality service

User-friendly online submission system

Fair and swift peer-review system

Efficient typesetting and proofreading procedure

Display of the result of downloads and visits, as well as the number of cited articles

Maximum dissemination of your research work

Submit your manuscript at: http://papersubmission.scirp.org/

Or contact ojs@scirp.org