^{1}

^{*}

^{1}

^{1}

Let two separate surveys collect related information on a single population U. Consider situation where we want to best combine data from the two surveys to yield a single set of estimates of a population quantity (population parameter) of interest. This Article presents a multiplicative bias reduction estimator for nonparametric regression to two sample problem in sample survey. The approach consists to apply a multiplicative bias correction to an estimator. The multiplicative bias correction method which was proposed, by Linton & Nielsen, 1994, assures a positive estimate and reduces the bias of the estimate with negligible increase in variance. Even as we apply this method to the two sample problem in sample survey, we found out through the study of it asymptotic properties that it was asymptotically unbiased, and statistically consistent. Furthermore an empirical study was carried out to compare the performance of the developed estimator with the existing ones.

Sometimes, it happens that two separate surveys gather related information on a variable of interest of a population, U, having perhaps distinct designs and mode of sampling. It becomes very important on how to combine the data from the two surveys.

Take as example, the students of the sub-regional institute of statistics and apply economics (ISSEA), and those of the polytechnic institute, both in different ways with different importances to collect data on unemployment in Cameroon. Researchers at the national institute of statistics (Cameroon) are faced with the following problem: how can the data from these two distinct surveys joined together to produce a single data and have a better representation of the population?

Some great scientists have been looking into these problems for several years. The approach to this problem have been in different ways; one of which involve getting estimates of the two surveys separately and using the inverse of the estimated variances as weights to weigh them together as seen in [

Just recently [

This study made use of kernel smoothers, especially the Nadaraya Watson smoother. However, estimators based on Nadaraya Watson smoothing weights are normally biased in small samples and at boundary points.

There exist alternative techniques of reducing the bias. For a detailed review see [

The remaining part of this paper is organized as follows: In Section 2, a multiplicative bias corrected estimator T ^ M B C for the finite population totals is proposed. In Section 3, the asymptotic properties of the proposed estimator are derived. In Section 4, an empirical study of the derived properties is presented. In Section 5 we give a conclusion to the paper.

Consider a finite population, U = 1 , 2 , ⋯ , N and let y 1 , y 2 , ⋯ , y n represent the combined random sample drawn from the population using different sampling techniques. Suppose that to each of these y ′ i s , there is an auxiliary information x 1 , x 2 , ⋯ , x n .

Let consider the following model;

E ( Y i / X i = x i ) = h ( x i ) (1)

cov ( Y i , Y j / X i = x i , X j = x j ) = { σ 2 ( x i ) , i = j 0 , i ≠ j (2)

where h ( x i ) and σ 2 ( x i ) are twice continuously differentiable functions (that is lipschitz continuous). With these assumptions on h ( x i ) and σ 2 ( x i ) , one can estimate h ( x i ) and σ 2 ( x i ) non-parametrically.

Let ϵ i = Y i − h ( X i ) be i.i.d. with zero mean, and variance σ 2 . We can refer to this set-up as the weak model. In this scheme, we can ignore which of the original samples, the Y ′ i s are available from.

Usually in the computation of finite population total,we have the formula given by

T = ∑ i ∈ U y i = ∑ i ∈ s y i + ∑ j ∈ r y j (3)

where, s refers to the sample and r refers to the nonsampled part of the population. Since the values of the sample part is known, the process of estimating the finite population total is equivalent to predicting the nonsample part of the population.

To do this, the multiplicative bias corrected technique is employed in which case the proposed estimator of the population total is now defined as

T ^ M B C = ∑ i ∈ s y i − h ^ ( x i ) π i + ∑ j ∈ r h ^ ( x j ) (4)

where

π i is the inclusion probability

h ^ ( x i ) is the multiplicative bias corrected estimator.

The principal objective of the multiplicative bias corrected technique is to correct the insufficiences of the kernel smoother that is the bias problem at the boundaries. Given a pilot smoother of the regression function

h ˜ ( x ) = ∑ j = 1 n w x j Y j (5)

The inverse relative estimation error of the smoother at each of the observations is given by h ( x ) h ˜ ( x ) .

A noisy estimate of the ratio, h ( x ) h ˜ ( x ) , is given by

β ( x ) = Y j h ˜ ( X j ) (6)

Smoothing the noisy estimate β ( x ) leads to

β ˜ ( x ) = ∑ j = 1 n w x j β ( x ) (7)

Above gives a better estimate for the inverse of the relative estimation error at each particular observation and can therefore be used as a multiplicative correction of the pilot smoother.

h ^ ( x ) = β ˜ ( x ) h ˜ ( x ) (8)

For both h ˜ ( x ) and β ˜ ( x ) , we use the same weighting scheme;

w x j = 1 n h K ( x − X j h ) (9)

where

h is the bandwidth

K is a probability density function, symmetric about zero.

n is the sample size

Bandwidth Selection Techniques● Implement biased cross-validation (bcv).

● Implement unbiased cross-validation (ucv).

● Implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator (ndr0)

● Can use a more common variation given by Scott (1992) (ndr)

The following assumptions are made in the estimation of h ^ ( x i ) .

● The regression function is bounded and strictly positive, that is, b ≥ h ( x ) ≥ a > 0 for all x

● The regression function is twice continuously differentiable everywhere.

● ϵ has finite fourth moments and has a symmetric distribution around zero.

● The bandwidth h is such that, h → 0 , n h → ∞ and ( n h ) 2 → ∞ as n → ∞

We want to show that E ( T ^ M B C − T ) → 0 as n → ∞ . Under the model based, the bias of the estimator T ^ M B C is defined as follows;

E [ T ^ M B C − T ] = E [ T ^ M B C ] − E [ T ] (10)

Now, we have the expected value of the proposed estimator for the finite population total given by;

E [ T ^ M B C ] = E [ ∑ i ∈ s y i − h ^ ( x i ) π i + ∑ j ∈ r h ^ ( x j ) ] (11)

= E [ ∑ i ∈ s y i − h ^ ( x i ) π i ] + E [ ∑ j ∈ r h ^ ( x j ) ] (12)

= ∑ i ∈ s 1 π i E ( y i − h ^ ( x i ) ) + ∑ U | s E ( h ^ ( x j ) ) (13)

E ( h ^ ( x j ) ) is obtained by analysing the individual terms of the stochastic approximation of h ^ ( x ) . Let us then establish the stochastic approximatiom of h ^ ( x ) as shown by (Hengartner 2009).

From (8),

h ^ ( x ) = β ˜ ( x ) h ˜ ( x ) (14)

= ∑ j = 1 n w x j Y j h ˜ ( X j ) h ˜ ( x ) = ∑ j = 1 n w x j h ˜ ( x ) h ˜ ( X j ) Y j (15)

= ∑ j = 1 n w x j R j ( x ) Y j where R j ( x ) = h ˜ ( x ) h ˜ ( X j ) (16)

Let define, h ¯ = E ( h ˜ ( x ) | X 1 , X 2 , ⋯ , X n ) then we can express R j ( x ) as.

R j ( x ) = h ˜ ( x ) h ˜ ( X j ) = ( h ¯ ( x ) h ¯ ( X j ) ) ∗ ( h ˜ ( x ) h ¯ ( x ) ) ∗ ( h ˜ ( X j ) h ¯ ( X j ) ) − 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ∗ ( h ˜ ( x ) − h ¯ ( x ) + h ¯ ( x ) h ¯ ( x ) ) ∗ ( h ˜ ( X j ) − h ¯ ( X j ) + h ¯ ( X j ) h ¯ ( X j ) ) − 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ∗ ( h ˜ ( x ) − h ¯ ( x ) h ¯ ( x ) + 1 ) ∗ ( h ˜ ( X j ) − h ¯ ( X j ) h ¯ ( X j ) + 1 ) − 1 = ( h ¯ ( x ) h ¯ ( X j ) ) ∗ ( R ( x ) + 1 ) ∗ ( R ( X j ) + 1 ) − 1

Through the series expansion,

( R ( X j ) + 1 ) − 1 = 1 R ( X j ) + 1 = 1 1 − ( − R ( X j ) ) = ∑ n = 0 ∞ [ − R ( X j ) ] n = 1 − R ( X j ) + R ( X j ) 2 + ⋯

R j ( x ) = h ¯ ( x ) h ¯ ( X j ) ∗ [ 1 + R ( x ) − R ( X j ) + r j ( x , X j ) ]

is an approximation of the quantity R.

Replacing both Y j and R j in (16), we obtain

h ^ ( x ) = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) − R ( X j ) + r j ( x , X j ) ] ( h ( X j ) + ϵ j ) = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) − R ( X j ) ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) − R ( X j ) ) ϵ j + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) r j ( x , X j ) ( h ( X j ) + ϵ j )

Using the assumption n h → ∞ the remainder term turns to zero in probability and the expression reduces to;

h ^ ( x ) = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) − R ( X j ) ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) − R ( X j ) ) ϵ j + 0 p ( 1 n h )

To solve Equation (16), we need to find E ( h ^ ( x j ) ) hence,

E ( h ^ ( x j ) ) = E [ ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( ϵ j + h ( X j ) ) ( R ( x ) − R ( X j ) ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) − R ( X j ) ) ϵ j + 0 p ( 1 n h ) ] = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) E ( h ( X j ) ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) E ( ϵ j ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) × E ( R ( x ) − R ( X j ) ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) ( R ( x ) − R ( X j ) ) E ( ϵ j ) + 0 p ( 1 n h ) = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) E ( h ˜ ( x ) h ¯ ( x ) − h ˜ ( X j ) h ¯ ( X j ) ) + 0 p ( 1 n h )

since E ( ϵ j = 0 )

E ( h ^ ( x j ) ) = ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) h ( X j ) + o p ( 1 n h ) since h ¯ ( x ) = E ( h ˜ ( x ) ) (17)

Hence,

E [ T ^ M B C ] = ∑ i ∈ s 1 π i E ( y i ) − ( ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X i ) h ( X i ) + o p ( 1 n h ) ) + ∑ U | s ( ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X i ) h ( X j ) ) + o p ( 1 n h ) (18)

The above expression can be reduced by considering a limited Taylor series of h ( X j ) h ¯ ( X j ) about a point x. Hence

h ( X j ) h ¯ ( X j ) = h ( x ) h ¯ ( x ) + ( X j − x ) ( h ( x ) h ¯ ( x ) ) ′ + ( X j − x ) 2 ( h ( x ) h ¯ ( x ) ) ′ ′ + o p ( 1 ) (19)

Now, substituting the first two terms in (18) gives

E [ T ^ M B C ] = ∑ i ∈ s 1 π i E ( y i ) − E ( h ^ ( x i ) ) + ∑ U | s ( ∑ j = 1 n w x j h ¯ ( x ) ( h ( x ) h ¯ ( x ) + ( X j − x ) ( h ( x ) h ¯ ( x ) ) ′ ) + o p ( 1 n h ) (20)

But ∑ j = 1 n w x j = 1 and ∑ j = 1 n ( X j − x ) w x j = 0 , therefore

E [ T ^ M B C ] = ∑ U | s h ( x ) + o p ( 1 n h ) (21)

Furthermore,

E ( T ) = ∑ i ∈ s E ( y i ) + ∑ j ∈ r E ( y j ) = ∑ i ∈ s y ¯ + ∑ j ∈ r h ( x )

Hence the asymptotic bis of the estimator is given by

B I A S ( T ^ M B C ) = E ( T ^ M B C − T N ) = 1 N ∑ i ∈ s y ¯ + o p ( 1 n h )

The bias of T ^ M B C will be of order o p ( 1 n h ) . Thus it converges to zero at a faster rate compared to the existing non-parametric estimators which generally converge at the rate o p ( h 2 ) .

The variance of the finite population total is given by;

V a r [ T ^ M B C ] = V a r [ ∑ i ∈ s y i − h ^ ( x i ) π i + ∑ j ∈ r h ^ ( x j ) ] = V a r [ ∑ i ∈ s y i − h ^ ( x i ) π i ] + V a r [ ∑ j ∈ r h ^ ( x j ) ] = ∑ i ∈ s ( 1 π i ) 2 V a r ( y i − h ^ ( x i ) ) + ∑ U | s V a r ( h ^ ( x j ) )

Firstly,

V a r ( h ^ ( x j ) ) = V a r ( ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) − R ( X j ) + r j ( x , X j ) ] ( h ( X j ) + ϵ j ) ) (22)

Using the assumption n h → ∞ , the remainder terms converge to zero in probability. Therefore r j ( x , X j ) ( h ( X j ) + ϵ j ) = 0 p ( 1 n h ) and Equation (22) reduces to

V a r ( h ^ ( x j ) ) = V a r ( ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) [ 1 + R ( x ) − R ( X j ) ] ( h ( X j ) + ϵ j ) + 0 p ( 1 n h ) ) (23)

Truncating the binomial expansion at the first term yields

V a r ( h ^ ( x j ) ) = V a r ( ∑ j = 1 n w x j h ¯ ( x ) h ¯ ( X j ) y j ) + 0 p ( 1 ( n h ) 2 ) = ∑ j = 1 n ( w x j h ¯ ( x ) h ¯ ( X j ) ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 )

Simplify the above expression by considering the first and second part of the Taylor series of σ 2 ( x j ) h ¯ 2 ( X j ) . So we obtain

V a r ( h ^ ( x j ) ) = ∑ j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (24)

Therefore,

V a r [ T ^ M B C ] = ∑ i ∈ s ( 1 π i ) 2 σ 2 ( x i ) + ∑ U ∑ j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (25)

Thus the asymptotic variance is given by

V a r ( T ^ M B C N ) = 1 N 2 ∑ i ∈ s ( 1 π i ) 2 σ 2 ( x i ) + 1 N 2 ∑ U ∑ j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) (26)

This implies that T ^ M B C is more efficient than the usual non-parametric regression estimator proposed by Dorfman (1992).

The asymtotic mean square error of the estimator T ^ M B C is given by

M S E [ T ^ M B C ] = V a r [ T ^ M B C ] + [ B i a s ( T ^ M B C ) ] 2 (27)

M S E [ T ^ M B C ] = 1 N 2 ∑ i ∈ s ( 1 π i ) 2 σ 2 ( x i ) + 1 N 2 ∑ U ∑ j = 1 n ( w x j ) 2 σ 2 ( x j ) + 0 p ( 1 ( n h ) 2 ) + [ 1 N ∑ i ∈ s y ¯ + o p ( 1 n h ) ] 2 (28)

As n → ∞ and h → ∞ , the M S E [ T ^ M B C ] turns to 0 indicating that, the proposed estimator is statistically consistent.

In this section, the theory developed in the previous section was tested using a set of simulation studies, with a mix of survey designs, and employing various approaches to selecting the best bandwidths. We employ a population U of countries in the world of size, N = 188, with auxiliary variable x = gross national product (GNI) and variable of interest y = human development index(HDI), of interest is the population total of the HDI, y = ∑ l ∈ U y l .

We suppose, for each run of the experiment that two samples are taken:

Sample 1 ( s 1 ): srswor ( n 1 = 32 )

Sample 2 ( s 2 ): stratsrs-four strata equal in each, and 8 units taken at random

in each, so that n 2 = 32 . The total experiment consists of 500 runs of pairs of samples.

For an estimator T ^ we considered three measures of relative success across the 500 runs:

i) Unconditional relative bias measured as ratio of mean value (across runs) to target

Bias = ∑ r u n s ( T ^ − T ) / T

ii) Unconditional relative root mean square error divided by target

rmse = ( ∑ r u n s ( T ^ − T ) ) 2 / T

Results obtained are tabulated in

From the results obtained, we observe that the unbiased cross validation approach is a viable means of selecting bandwidth as it gives the lowest bias and root mean square error across all the estimators. The proposed estimator to the two sample problem gives better estimates of the population total compared to those realized using the estimator proposed by [

Furthermore, we study the conditional performances of the selected estimators. 500 samples obtained were sorted by the values of the mean of the auxiliary variable and put in 25 groups each containing 20 values. We then compute the bias and root mean square error of each group. The plots of conditional performances against the average of the sorted mean auxiliary variable. We then report the behaviour of the conditional bias for the different bandwidth.

Estimator | Formula | Comment |
---|---|---|

Non parametric (NP) Regression | T ^ N P = ∑ i ∈ s y i + ∑ j ∈ r h ^ ( x j ) | |

Nonparametric (NPT) regression, twiced | T ^ N P T = ∑ i ∈ s y i − h ^ ( x i ) π i + ∑ j ∈ U h ^ ( x j ) | π i = Inclusion probabilities |

Multiplicative (MBC) Bias Corrected | T ^ M B C = ∑ i ∈ s y i − h ^ ( x i ) π i + ∑ j ∈ U h ^ ( x j ) | π i = Inclusion probabilities |

Estimators | Bandwidth | Bias/T | 10 rmse/T |
---|---|---|---|

NP(one sample) | ndr | 0.25 | 19.63 |

ndr0 | 0.26 | 20.14 | |

bcv | 0.11 | 20.71 | |

ucv | 0.37 | 17.10 | |

NP(s1Us2) | ndr | 0.01 | 10.5 |

ndr0 | 0.01 | 10.49 | |

bcv | 0.45 | 11.19 | |

ucv | 0.05 | 8.22 | |

NPT | ndr | 0.05 | 9.93 |

ndr0 | 0.24 | 10.32 | |

bcv | 0.39 | 10.83 | |

ucv | 0.09 | 8.54 | |

MBC | ndr | 0.20 | 10.23 |

ndr0 | 0.02 | 9.97 | |

bcv | 0.23 | 10.17 | |

ucv | 0.01 | 8.20 |

The aim of this study was to develop an estimator with the lowest bias for the finite population total using the multiplicative bias corrected approach to non parametric regression. This study reveals that the proposed estimator is more efficient than the modified nonparametric estimator (NPT). With a suitable bandwidth selection (ucv), the proposed estimator has the smallest bias and root mean square error values. It has therefore proven to be efficient in resolving the boundary value problem that is associated with the existing nonparametric smoothers.

My first appreciation goes to my supervisors Professor Odhiambo and Doctor Mageto for accompanying me through this work. Also, alot of thanks to the African Union for providing for this scientific reseach and placing such confident in its youth. Lastly but not the least, thanks to my family for their support.

Stephane, K.T., Otieno, R.O. and Mageto, T. (2017) A Multiplicative Bias Correction for Nonparametric Approach and the Two Sample Problem in Sample Survey. Open Journal of Statistics, 7, 1053-1066. https://doi.org/10.4236/ojs.2017.76073