Open Journal of Statistics
Vol.07 No.05(2017), Article ID:79925,15 pages
10.4236/ojs.2017.75059
Estimating a Finite Population Mean under Random Non-Response in Two Stage Cluster Sampling with Replacement
Nelson Kiprono Bii1, Christopher Ouma Onyango2, John Odhiambo1
1Institute of Mathematical Sciences, Strathmore University, Nairobi, Kenya
2Department of Statistics, Kenyatta University, Nairobi, Kenya
Copyright © 2017 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Received: September 1, 2017; Accepted: October 24, 2017; Published: October 27, 2017
ABSTRACT
Non-response is a regular occurrence in Sample Surveys. Developing estimators when non-response exists may result in large biases when estimating population parameters. In this paper, a finite population mean is estimated when non-response exists randomly under two stage cluster sampling with replacement. It is assumed that non-response arises in the survey variable in the second stage of cluster sampling. Weighting method of compensating for non-response is applied. Asymptotic properties of the proposed estimator of the population mean are derived. Under mild assumptions, the estimator is shown to be asymptotically consistent.
Keywords:
Non-Response, Nadaraya-Watson Estimation, Two Stage Cluster Sampling
1. Introduction
In survey sampling, non-response is one source of errors in data analysis. Nonresponse introduces bias into the estimation of population characteristics. It also causes samples to fail to follow the distributions determined by the original sampling design. This paper seeks to reduce the non-response bias in the estimation of a finite population mean in two stage cluster sampling.
Use of regression models is recognized as one of the procedures for reducing bias due to non-response using auxiliary information. In practice, information on the variables of interest is not available for non-respondents but information on auxiliary variables may be available for non-respondents. It is therefore desirable to model the response behavior and incorporate the auxiliary data into the estimation so that the bias arising from non-response can be reduced. If the auxiliary variables are correlated with the response behavior, then the regression estimators would be more precise in estimation of population parameters, given the auxiliary information is known.
Many authors have developed estimators of population mean where non-response exists in the study and auxiliary variables. But there exist cases that do not exhibit non-response in the auxiliary variables, such as: number of people in a family, duration one takes to go through education. Imputation techniques have been used to account for non-response in the study variable. For instance, [1] applied compromised method of imputation to estimate a finite population mean under two stage cluster sampling, this method however produced a large bias. In this study, the Nadaraya-Watson regression technique is applied in deriving the estimator for the finite population mean. Kernel weights are used to compensate for non-response.
Reweighting Method
Non-response causes loss of observations and therefore reweighting means that the weights are increased for all or almost all of the elements that fail to respond in a survey. The population mean, , is estimated by selecting a sample of size n at random with replacement. If responding units to item y are independent so that the probability of unit j responding in cluster i is then an imputed estimator, , for , is given by
(1.0)
where gives sample survey weight tied to unit j in cluster i and
is its second order probability of inclusion, , is the set of r units responding to item y and is the set of m units that failed to respond to item y so that and is the imputed value generated so that the missing value is compensated for, [2] .
2. The Proposed Estimator of Finite Population Mean
Consider a finite population of size M consisting of N clusters with elements in the ith cluster. A sample of n clusters is selected so that units respond and units fail to respond. Let denote the value of the survey variable Y for unit j in cluster i, for , and let population mean be given by
(2.1)
Let an estimator of the finite population mean be defined by as follows:
(2.2)
where is an indicator variable defined by
and and are the number of units that respond and those that fail to respond respectively.
is the probability of selecting the jth unit in the ith cluster into the sample.
Let to be the inverse of the second order inclusion probabilities and that is the ith auxiliary random variable from the jth cluster. It follows that Equation (2.2) becomes
(2.3)
Suppose is known to be Bernoulli random variables with probability of success , then, and , [3] . Thus, the expected value of the estimator of population mean is given by
(2.4)
Assuming non-response in the second stage of sampling, the problem is therefore to estimate the values of . To do this, a linear regression model applied by [4] and [5] given below is used;
(2.5)
where is a smooth function of the auxiliary variables and is the residual term with mean zero and variance which is strictly positive, Substituting Equation (2.5) in Equation (2.4) the following result is obtained:
(2.6)
Assuming that , and simplifying Equation (2.6) we obtain the following
(2.7)
A detailed work done by [5] proved that . Therefore Equation (2.7) reduces to
(2.8)
The second term in Equation (2.8) is simplified as follows:
(2.9)
But , [6] . Thus we get the following:
(2.10)
(2.11)
But , for details see [5] .
On simplification, Equation (2.11) reduces to
(2.12)
Recall
so that Equation (2.12) may be re-written as follows:
(2.13)
Assume the sample sizes are large i.e. as and , Equation (2.13) simplifies to
(2.14)
Combining Equation (2.14) with the first term in Equation (2.08) becomes;
(2.15)
Since the first term represents the response units, their values are all known. The problem is to estimate the non-response units in the second term. Let the indicator variable , the problem now reduces to that of estimating the function , which is a function of the auxiliary variables, . Hence the expected value of the estimator of the finite population mean under non-response is given as;
(2.16)
In order to derive the asymptotic properties of the expected value of the proposed estimator in 2.16, first a review of Nadaraya-Watson estimator is given below.
3. Review of Nadaraya-Watson Estimator
Given a random sample of bivariate data having a joint pdf with the regression model given by
as in Equation (2.5), where is unknown. Let the error term satisfy the following conditions:
(3.0)
Furthermore, let denote a symmetric kernel density function which is twice continuously differentiable with:
(3.1)
In addition, let the smoothing weights be defined by
(3.2)
where b is a smoothing parameter, normally referred to as the bandwidth such that, .
Using Equation (3.2), the Nadaraya-Watson estimator of is given by:
(3.3)
Given the model and the conditions of the error term as explained in 3.0 above, the expression for the survey variable relative to the auxiliary variable can be given as a joint pdf of as follows:
(3.4)
where is the marginal density of . The numerator and the denominator of Equation (3.4) can be estimated separately using kernel functions as follows:
is estimated by;
(3.5)
and
(3.6)
Using change of variables technique; let
(3.7)
So that
(3.8)
(3.9)
From the conditions specified in Equation (3.1), the following (3.9) simplifies to
(3.10)
which reduces to:
(3.11)
Following the same procedure, the denominator can be obtained as follows:
(3.12)
Using change of variable technique as in Equation (3.7), Equation (3.12) can be re-written as follows:
(3.13)
which yields
(3.14)
Since is a pdf and therefore integrates to 1.
It follows from Equations ((3.11) and (3.14)) that the estimator is as given in Equation (3.3). Thus the estimator of is a linear smoother since it is a linear function of the observations, . Given a sample and a specified kernel function, then for a given auxiliary value , the corresponding y-estimate is obtained by the estimator outlined in Equation (3.3), which can be written as:
(3.15)
where is the Nadaraya-Watson estimator for estimating the unknown function , for details see [7] [8] .
This provides a way of estimating for instance the non-response values of the survey variable , given the auxiliary values , for a specified kernel function.
4. Asymptotic Bias of the Mean Estimator
Equation (2.16) may be written as
(4.1)
Replacing by and re-writing Equation (3.15) using the property of symmetry associated with Nadaraya-Watson estimator, then
(4.2)
(4.3)
where is the estimated marginal density of auxiliary variables .
But for a finite population mean, the expected value of the estimator is given in Equation (4.1). The bias is given by
(4.4)
(4.5)
Which reduces to
(4.6)
(4.7)
Re-writing the regression model given by as
(4.8)
So that from Equation (4.3) the first term in Equation (4.7) before taking the expectation is given as:
(4.9)
Simplifying Equation (4.9) the following is thus obtained:
(4.10)
where
Taking conditional expectation of Equation (4.10) we get
(4.11)
To obtain the relationship between the conditional mean and the selected bandwidth, the following theorem due to [6] is applied;
Theorem: (Dorfman, 1992)
Let be a symmetric density function with and . Assume n and N increase together such that with
. Besides, assume the sampled and non-sampled values of x are in the interval and are generated by densities and respectively both bounded away from zero on and assumed to have continuous second derivatives. If for any variable , and , then .
Applying this theorem, we have
(4.12)
This theorem is stated without proof. To prove it, we partition it into the bias and variance terms and separately prove them as follows:
From Equation (3.0) it follows that . Therefore, . Thus can be obtained as follows:
(4.13)
Using substitution and change of variable technique below
(4.14)
Equation (4.13) can simplify to:
(4.15)
(4.16)
Using the Taylor’s series expansion about the point , the kth order kernel can be derived as follows:
(4.17)
Similarly,
(4.18)
Expanding up to the 3rd order kernels, Equation (4.18) becomes
(4.19)
In a similar manner, the expansion of Equation (4.16) up to order is given by:
(4.20)
Simplifying Equation (4.20) gives;
(4.21)
Using the conditions stated in Equation (3.1), the derivation in (4.21) can further be simplified to obtain:
(4.22)
Hence the expected value of the second term in Equation (4.11) then becomes:
(4.23)
(4.24)
(4.25)
where
(4.26)
and is as stated in Equation (3.1)
Using equation of the bias given in (4.4) and the conditional expectation in Equation (4.11), we obtain the following equation for the bias of the estimator:
(4.27)
5. Asymptotic Variance of the Estimator,
From Equations ((4.9) and (4.11)),
(5.0)
Hence
(5.1)
where
Expressing Equation (5.1) in terms of expectation we obtain:
(5.2)
Using the fact that the conditional expectation
, the second term in Equation (4.13) reduces to zero. Therefore,
(5.3)
where
(5.4)
(5.5)
(5.6)
which can be simplified to get:
(5.7)
(5.8)
(5.9)
Hence
(5.10)
where so that .
Changing variables and applying Taylor’s series expansion then
(5.11)
(5.12)
which simplifies to
(5.13)
For large samples, as , and for , then . Hence the variance in Equation (5.12) asymptotically tends to zero, that is,
(5.14)
On simplification,
(5.15)
Substituting Equations ((5.7) into (5.15)) yields the following:
(5.16)
(5.17)
where,
It is notable that the variance term still depends on the marginal density function, of the auxiliary variables . It can also be observed that the variance is inversely related to the smoothing parameter b. This implies that an increase in b results in a smaller variance. However, increasing the bandwidth would give a larger bias. Therefore there is a trade-off between the bias and variance of the estimated population mean. A bandwidth that provides a compromise between the two measures would therefore be desirable.
6. Mean Squared Error (MSE) of the Finite Population Mean Estimator
The MSE of combines the bias and the variance terms of this estimator that is,
(6.0)
(6.1)
Expanding Equation (6.1) gives:
(6.2)
(6.3)
Combining the bias in Equation (4.27) and the variance in Equation (5.17) and conditioning on the auxiliary values of the auxiliary variables then
(6.4)
(6.5)
where , , as used earlier in the rest of the derivations.
7. Conclusion
If the sample size is large enough, that is as
and
the
Cite this paper
Bii, N.K., Onyango, C.O. and Odhiambo, J. (2017) Estimating a Finite Population Mean under Random Non-Response in Two Stage Cluster Sampling with Replacement. Open Journal of Statistics, 7, 834-848. https://doi.org/10.4236/ojs.2017.75059
References
- 1. Singh, S. and Horn, S. (2000) Compromised Imputation in Survey Sampling. Metrika, 51, 267-276. https://doi.org/10.1007/s001840000054
- 2. Lee, H., Rancourt, E. and Sarndal, C. (2002) Variance Estimation from Survey Data under Single Imputation. Survey Nonresponse, 315-328.
- 3. Bethlehem, J.G. (2012) Using Response Probabilities for Assessing Representativity. Statistics Netherlands, International Statistical Review, 80, 382-399.
- 4. Ouma, C. and Wafula, C. (2005) Bootstrap Confidence Intervals for Model-Based Surveys. East African Journal of Statistics, 1, 84-90.
- 5. Onyango, C.O., Otieno, R.O. and Orwa, G.O. (2010) Generalized Model Based Confidence Intervals in Two Stage Cluster Sampling. Pakistan Journal of Statistics and Operation Research, 6. https://doi.org/10.18187/pjsor.v6i2.128
- 6. Dorfman, A.H. (1992) Nonparametric Regression for Estimating Totals in Finite Populations. In: Proceedings of the Section on Survey Research Methods, American Statistical Association Alexandria, VA, 622-625.
- 7. Nadaraya, E.A. (1964) On Estimating Regression. Theory of Probability & Its Applications, 9, 141-142. https://doi.org/10.1137/1109020
- 8. Watson, G.S. (1964) Smooth Regression Analysis. Sankhya: The Indian Journal of Statistics, Series A, 359-372.