_{1}

In leading petroleum-producing countries like Kuwait, Brazil, Iran , Iraq and Mexico oil spills frequently occur on land, causing serious damage to crop fields. Soil remediation requires constant monitoring of the polluted area. One common monitoring method involves two-dimensional systematic sampling, which can be used to estimate the proportion of the contaminated soil and study the oil spills’ geographic distribution. A well-known issue using this sampling design involves the analytical derivation of variance of the sample mean (proportion), which requires at least two independent samples. To address the problem, this research proposed a variance estimator based on regression and a corrected estimator using the autocorrelation Geary Index under the model-assisted approach. The construction of the estimators was assisted by geo-statistical models by simulating an auxiliary variable. Similar populations to those in real oil spills were recreated, and the accuracy of proposed estimators was evaluated by comparing their performance with other well-known estimators. The factors considered in this simulation study were: a) the model for simulating the populations (exponential and wave), b) the mean and the variance of the process, c) the level of autocorrelation among units. Given the statistical and computing burdens (bias, ratio between estimated and real variance, convergence and computer time), under the exponential model, the regression estimator showed the best performance; and for the wave model, the corrected version performed even better.

Frequently, sampling in two dimensions is applied in small areas, resulting in small population in the situation when petroleum spills are studied. In places where these spills are common, oil contamination causes serious damage to soils and water bodies. Remediating the soil is expensive and requires careful monitoring. During this process, a soil sample is taken to estimate the proportion of contaminated area; however this method can be problematic if it does not yield accurate results. Historically, three different approaches have been used to perform and analyze sampling (design-based, model-based and model-assisted). One potentially superior monitoring strategy is the systematic sampling design, which offers a uniform coverage of the study area. Using this strategy, not only punctual estimations of the proportion and the variance are obtained, but also the collected information can be used to perform geo-statistical studies about the distribution of the pollutant area.

Nevertheless, this sampling design has an unresolved difficulty. To obtain an unbiased estimate of the variance of the proportion, at least two independent samples are required; otherwise, by using just one sample, only an approximation of the variance can be computed [

This paper introduces two new variance estimators constructed under the model-assisted approach based on geo-statistical models. The estimators are the regression estimator of the variance, V ^ R E G ( p ^ ) and its corrected version, V ^ R E G _ G ( p ^ ) , which takes the existing autocorrelation among the units in the sample into account through the Geary Index [

The article is organized as follows. Section 2 provides a brief explanation of the two-dimensional systematic sampling. In Section 3, the estimators used to overcome the variance estimator issues are introduced. Section 4 gives a description of the design and the simulation study. The results and discussion are presented in Section 5. Finally, Section 6 formulates the conclusions and offers some recommendations.

Systematic sampling designs are commonly used in real-life applications due to their straightforward implementation. Moreover, when proportions are estimated in finite populations, their results are frequently more efficient than other sampling design alternatives (i.e. simple random sampling, stratified sampling, etc.). These properties made it attractive to consider this over an area where the population units are in a regular spaced array. Thus, this sampling design provides a uniform coverage of the area; which can take advantage of the information of the location of the sample units for accounting for the spatial correlation. For example, when geographically close sampling units show a high positive correlation (geographically closer units tend to be more similar than units more distant from each other, as in the case of an oil spill, it is possible to obtain more accurate estimations.

Two-dimensional systematic sampling consists of the random selection of an initial point, and the remaining points are selected by following a regular pattern (e.g. a rectangular or a square arrangement). In these arrangements, a sample is obtained by randomly selecting a square in the first domain (

Thus, when a two-dimensional systematic sample of size n is needed, the N squares are grouped into n = nC × nR non-overlapping rectangular sub-regions or domains, and each domain contains k = kC × kR squares, where k = N/n is the number of all possible systematic samples.

Suppose there is a binary random variable Z = f ( t ) that may take values 0 or 1 for a fixed value of t. Then, an unbiased estimator of the population proportion p (mean) of Z in D by using the j^{th} two-dimensional systematic sample is given by

p ^ S Y S . j = ∑ i = 1 n z i j n , (1)

where z i j = { 1 if T ≥ t 0 if T < t

with T as a continuous random variable and t as a fixed predetermined threshold.

The sampling error provides information about the variance of the estimator. An unbiased variance estimator for the population proportion (mean) can be obtained through

V S Y S = ∑ j = 1 k ( p ^ j . S Y S − p ) 2 k . (2)

Under this sampling design, computing an unbiased estimator for the variance requires at least two independent systematic samples, and by using a single sample, only approximations can be derived [

The estimation of the sampling error through a single systematic sample is more difficult in two dimensions than for the linear case, because the units are arranged in a plane instead of a line [

For selecting samples and performing correspondent analysis, three different approaches have been considered: design-based, model-based and model-assisted. In the design-based approach [

The results (estimations) of these approaches are not directly comparable, because they arise under different assumptions. Nevertheless, a few authors have tried to give explanations to why and how to perform comparisons. For example, Särndal [

Next, the variance estimators that were considered in this study for comparison purposes are described. Then, the proposed models are introduced.

A) Design-based approaches

1) Simple random sampling estimator

The variance estimator of the proportion (mean) under the simple random sampling estimation scheme can be written as

V ^ S R S ( p ^ ) = ( 1 − f ) p ^ ( 1 − p ^ ) n − 1 , (3)

where p ^ is as defined in (1) and f is the proportion of the population selected for a sample. (1 − f) is called the finite population correction or adjustment. In sampling without replacement, the sample variance is reduced by this factor.

The variance estimator (3) of the proportion (mean) under the simple random sampling scheme is unique in a way that it can be used without taking the spatial array of the units into account. Good estimations are expected if the distance between the sampled squares is large enough to have a small spatial correlation, or zero in the best case. In the presence of high homogeneity between sampled units the variance will be underestimated [

2) Geary’s spatial autocorrelation index

Marcello [

The first estimator that formally takes the presence of the correlation into account is constructed with the autocorrelation Geary Index, and can be written as

V ^ G I ( p ^ ) = V ^ S R S ( p ^ ) c j , (4)

where c j = n − 1 ∑ i = 1 n ∑ i ≠ l n ( z l − z i ) 2 δ i l 2 ∑ i = 1 n ∑ i ≠ l n δ i l ∑ i = 1 n ( ( z i − p ^ ) ) 2

is the Geary index computed for the j^{th} sample, δ i l = 1 if the i^{th} and the l^{th} units are in adjacent domains or 0 otherwise, and z is defined as before (i.e., binary outcome). Here, c_{j} measures the grade of similarity among sampling units.

3) Moran’s spatial autocorrelation statistic

This estimator is constructed with the Moran’s spatial autocorrelation statistic.

V ^ M S ( p ^ ) = V ^ S R S ( p ^ ) w j , (5)

where

w j = [ 1 + 2 log ( l j ) + 2 ( 1 l j − 1 ) ] ,

and

l j = n ∑ i = 1 n ∑ l ≠ i n δ i l − ∑ i = 1 n ∑ l ≠ i n ( y i − p ^ ) ( y l − p ^ ) δ i l ∑ i = 1 n ( y i − p ^ ) 2

and δ i l is defined as in (4).

Here, l_{j} is the Moran autocorrelation statistic computed for the j^{th} sample, and it measures the dissimilarity grade between sampling units.

B) Model-based approach

Briefly, this approach considers the N values of a population { y 1 , y 2 , ⋯ , y N } as realized outcomes of N random variables { Y 1 , Y 2 , ⋯ , Y N } resulting in a N-dimensional joint distribution also known as superpopulation.

In the model-based approach under geo-statistical modeling, Aubry & Debouzie [

V ^ M B ( p ^ ) = 1 S − 1 ∑ s = 1 S ( E s − E ¯ ) 2 , (6)

where E ¯ = ∑ s = 1 S E s S and E s = p ^ s − P s .

In this case, p ^ s is the estimated proportion of P s obtained by using the j^{th} sample for simulating the population { z 1 , z 2 , ⋯ , z N } in the s^{th} realization (iteration).

C) Model-assisted approaches

This research proposes a regression estimator and its’ correction by using the Geary Index. These estimators arise from the model-assisted approach, and their construction is assisted by simulating the auxiliary variable concentration of the Total Petroleum Hydrocarbons (TPH).

1) Single regression estimator

The variance estimator of the proportion can be written as

V ^ R E G _ s i n g l e ( p ^ ) = ( 1 − n N ) ( 1 n ) S e ^ 2 , (7)

where S e ^ 2 = ∑ i n ( e ^ i − e ¯ ^ ) 2 n − 1 , e ¯ ^ = ∑ i = 1 n e ^ i n , e ^ i = y i − y ^ i

and y ^ i = p ^ s + β 2 ( x i − x ¯ s ) is the predicted y-value for the i^{th} unit using the j^{th} sample, with β 2 = ∑ i = 1 n ( x i − x ¯ ) ( z i − z ¯ ) ∑ i = 1 n ( x i − x ¯ ) 2 .

Here y ^ i and β 2 are constructed by simulating the auxiliary covariate x i (TPH concentration), and z i denotes the corresponding discrete value (i.e., z i = 1 if x i ≥ 1000 and z i = 0 otherwise; a unit will be considered contaminated if the TPH concentration is greater or equal to 1000). Then, the auxiliary variable is simulated for the entire population, and the units that occupy the same position in the original sample across domains are selected for analysis.

2) Correction of the regression estimator using the Geary Index

The variance estimator of the proportion using the Geary Index can be written as

V ^ R E G _ s i n g l e _ c j ( p ^ ) = V ^ R E G _ s i n g l e ( p ^ ) c j , (8)

where the terms are as defined previously.

D) Model-assisted approach (averaged)

The estimators (7) and (8) are obtained by using only one realization of the TPH population. Under the model-based approach, the estimator (6) needs a great number of iterations. To compare the accuracy under the same number of iterations, estimators (9) and (10) are introduced to be able to compare with estimator (6).

V ^ ¯ R E G . A V ( p ^ ) = ∑ s = 1 S V ^ R E G ( p ^ s ) S , (9)

where V ^ R E G ( p ^ s ) is obtained for the s^{th} iteration.

Correction to (7) using the Geary Index can be written as

V ^ ¯ R E G . A V _ c j ( p ^ ) = V ^ ¯ R E G . A V ( p ^ ) c j . (10)

The estimation process in the model-based approach requires a great number of iterations; at least 10,000 are recommended by Aubry and Debouzie [

The first part, called one-step simulation, considered 34 cases that were derived from combinations of the following factors.

Geo-statistical model. Two models were selected for generating data:

The wave model and the exponential model [

The mean and the variance of the process. Three values that are common in this type of populations were used for the mean (980, 1000, 1020) and two for the variance (300, 600).

Autocorrelation index. Three different levels of autocorrelation indices were employed: low, medium and high (1.5, 2.8 and 4.8), respectively.

Under the exponential model, estimators (3), (4), (5) and (7) were evaluated, while for the wave model, estimator (8) was also considered. Estimator (8) was employed because it provides better results.

In the second part of the study, called the averaging simulation, 6 cases were evaluated. Three cases corresponded to the exponential model, and the remaining ones to the wave model. In the first group, the mean (980) and the variance (300) were kept constant for all autocorrelation levels (1.5, 2.8 and 4.8); while in the second group, they were held at 1020 and 600, respectively.

The averaging simulation was carried out to observe the performance of the model-based estimator (6) and the model-assisted estimators (9) and (10) by using the same number of iterations (1000). These results were compared against the estimators (7) and (8), which were constructed with a single iteration.

1) For each one of the 34 cases, the TPH ( w 1 , w 2 , ⋯ , w N ) values were generated, and this population was considered as the real population. Next, the correspondent discrete values were assigned: units with TPH contents higher than 1000 ppm had a value 1 and 0 otherwise ( y 1 , y 2 , ⋯ , y N ) . The simulation procedure was as described below.

2) Divide the real population into 9 systematic samples.

a) With a single sample, estimate the parameters (mean, variance and scale) for the geo-statistical model.

b) Several candidate models for semi-variogram were tested; the comparison criteria for selecting the best model were the mean square error and the Akaike Information Criteria [

c) Using the estimated model, the population of TPH was simulated once, and the estimators (3)-(5), (7) and (8) were computed.

3) Repeat steps (a)-(c) for the 9 systematic samples.

4) Compute the ratio R between averaged estimate variance of the proportion, using the current estimator, and the average variance of the proportion using all the systematic samples of the real population as

R = E ( V ^ ( p ^ ) ) V ( p ^ ) = ∑ j = 1 9 V ^ ( p ^ j ) ∑ j = 1 9 ( p ^ ¯ − p ^ j ) 2 , (11)

where V ^ ( p ^ j ) is the estimated variance for the j^{th} simulated sample using the current estimator;

p ^ ¯ = ∑ j = 1 9 p ^ j 9

corresponds to the mean of the estimated proportion by using all samples of the real population. The best estimators are those for which R = 1. The higher values of R overestimate, and those that are less than 1 underestimate the parameter of the variance of the proportion, respectively.

5) Steps (1)-(5) were repeated 1000 times, then 1000 initial populations of TPH were considered, and the results of these replications were averaged.

1) Equal to step (1) of one-step simulation.

2) Equal to steps (1)-(a, b) of one step simulation, but in (1)-(c) instead of simulating one population of TPH 1000 were generated. For each simulated population estimators (3)-(6), (9) and (10) are obtained, and at the end these values were averaged. By using only one of these populations (7) and (8) were calculated, too.

3) Repeat (a)-(c) for the 9 systematic samples.

4) Compute the ratio as (3) of one-step simulation.

5) Steps (1)-(5) were repeated 200 times, then 200 populations of TPH were considered; results of these iterations were averaged.

For performing the simulations an R (2.3.1) [

In order to determine the best performance, the following criteria were considered: ratio average closer to 1; minimum risk of sub estimating the parameter; minimum mean square error; stability and accuracy through different levels of autocorrelation for each model. Using these criteria, the systematic selection strategy that provided the best estimator was set up as follows. First, those estimators that incurred in serious sub-estimations through the different factors were discarded. Then the accuracy and the minimum mean square error were calculated, respectively, as an essential criterion for deciding on the best estimator.

Under the exponential model, the estimators (3), (4) and (7) showed a similar trend (

For the wave model, by reviewing the behavior through the levels of autocorrelation, two groups of estimators were found. In the first one, the trend of estimates made with (3) and (4) increased as the autocorrelation level increased (

according to [

In this simulation study, the estimator (7) showed a periodic and opposite behavior in the accuracy when comparing the exponential and wave models. This behavior is linked to the amount of autocorrelation among sampled units. In the exponential model, the accuracy improved, and the mean square error decreased as the level of autocorrelation increased (

Under the wave model, different effects occurred: the accuracy decreased and the mean square error remained practically unchanged as the autocorrelation increased (

The selection of an estimator must be performed carefully by considering possible implications and risks of each option. For example, in both models the

Estimator | ||||||
---|---|---|---|---|---|---|

Mean | Variance | Autocorrelation levels | (3) | (4) | (5) | (7) |

1.5 | 2.32E−08 | 2.06E−08 | 2.47E−08 | 1.56E−08 | ||

980 | 300 | 2.8 | 3.25E−08 | 1.97E−08 | 1.90E−08 | 1.09E−08 |

4.8 | 3.86E−08 | 1.40E−08 | 1.28E−08 | 8.48E−09 | ||

1.5 | 5.83E−08 | 4.91E−08 | 5.11E−08 | 3.84E−08 | ||

980 | 600 | 2.8 | 7.41E−08 | 4.14E−08 | 3.61E−08 | 2.08E−08 |

4.8 | 9.31E−08 | 3.19E−08 | 2.74−E−8 | 1.50E−08 | ||

1.5 | 1.38E−07 | 1.11E−07 | 1.16E−07 | 1.17E−07 | ||

1000 | 300 | 2.8 | 1.74E−07 | 8.50E−08 | 7.46E−08 | 5.01E−08 |

4.8 | 2.11E−07 | 6.28E−08 | 5.55E−08 | 2.67E−08 | ||

1.5 | 1.39E−07 | 1.10E−07 | 1.03E−07 | 1.04E−07 | ||

1000 | 600 | 2.8 | 1.80E−07 | 8.93E−08 | 7.17E−08 | 4.88E−08 |

4.8 | 2.20E−07 | 6.81E−08 | 5.37E−08 | 2.77E−08 | ||

1.5 | 2.45E−08 | 2.15E−08 | 2.59E−08 | 1.72E−08 | ||

1020 | 300 | 2.8 | 3.10E−08 | 1.87E−08 | 1.81E−08 | 1.04E−08 |

4.8 | 4.17E−08 | 1.59E−08 | 1.35E−08 | 9.63E−09 | ||

1.5 | 5.60E−08 | 4.72E−08 | 5.17E−08 | 3.91E−08 | ||

1020 | 600 | 2.8 | 7.07E−08 | 3.91E−08 | 3.93E−08 | 2.18E−08 |

4.8 | 9.25E−08 | 3.28E−08 | 2.85E−08 | 1.57E−08 |

Estimator | |||||||
---|---|---|---|---|---|---|---|

Mean | Variance | Autocorrelation levels | (3) | (4) | (5) | (7) | (8) |

1.5 | 7.89E−08 | 4.02E−08 | 7.80E−10 | 2.11E−08 | 1.31E−08 | ||

980 | 300 | 2.8 | 8.48E−08 | 1.40E−08 | 1.45E−09 | 2.31E−08 | 7.84E−09 |

4.8 | −− | −− | −− | −− | −− | ||

1.5 | 1.57E−07 | 7.59E−08 | 5.05E−09 | 2.26E−08 | 1.89E−08 | ||

980 | 600 | 2.8 | 1.88E−07 | 2.72E−08 | 3.10E−09 | 3.07E−08 | 1.15E−08 |

4.8 | 1.86E−07 | 8.76E−09 | 4.52E−09 | 2.79E−08 | 7.29E−09 | ||

1.5 | 3.76E−07 | 1.58E−07 | 7.64E−09 | 2.12E−08 | 2.35E−08 | ||

1000 | 300 | 2.8 | 4.17E−07 | 5.23E−08 | 8.13E−09 | 3.03E−08 | 1.70E−08 |

4.8 | 4.00E−07 | 1.73E−08 | 1.08E−08 | 3.04E−08 | 1.33E−08 | ||

1.5 | 3.70E−07 | 1.56E−07 | 9.27E−09 | 2.11E−08 | 2.43E−08 | ||

1000 | 600 | 2.8 | 4.25E−07 | 5.20E−08 | 6.27E−09 | 3.09E−08 | 1.58E−08 |

4.8 | 4.00E−07 | 1.70E−08 | 1.15E−08 | 2.90E−08 | 1.40E−08 | ||

1.5 | 7.05E−07 | 3.80E−08 | 2.49E−09 | 1.75E−08 | 1.34E−08 | ||

1020 | 300 | 2.8 | 8.46E−08 | 1.38E−08 | 1.46E−09 | 2.26E−08 | 7.86E−09 |

4.8 | −− | −− | −− | −− | −− | ||

1.5 | 1.66E−07 | 8.17E−08 | 4.66E−09 | 2.39E−08 | 1.90E−08 | ||

1020 | 600 | 2.8 | 1.82E−07 | 2.66E−08 | 2.96E−09 | 3.02E−08 | 1.13E−08 |

4.8 | 1.82E−07 | 8.23E−09 | 4.28E−09 | 2.57E−08 | 7.77E−09 |

estimator (5) showed a particular behavior: its estimates were the most accurate but in many cases, incurred in serious sub-estimations. The trend of estimator (4) was the opposite. In both models, the bias increased as the autocorrelation increased. The estimator (7) did not show any change in trends through the models. Finally, the estimator (3), which comes from simple random sampling, produced the worse estimations, always over-estimating the variance of the proportion.

In this simulation study, estimator (6) was introduced, which was constructed under the model-based approach. Using this approach, a great number of iterations were necessary to produce reliable results. In this case, 1000 iterations were used. Estimators (9) and (10) were introduced for comparison purposes. First, to compare the estimators’ behavior against the model-based estimator under the same number of iterations; and second, to compare the behavior against estimators (7) and (8), which perform the estimation procedure by using only one iteration. Estimators (3), (4) and (5) were also included as references.

Applying the same selection criteria from the one-step simulation to the exponential model, the best performance was shown by estimators (6), (7) and (9). Their estimations, (accuracy and mean square error) were close to each other, respectively. The main difference lies in the fact that the second of them used only 200 iterations in the construction instead of 200 × 1000 = 200,000 iterations for the others (

Under the wave model, by adopting a conservative posture, the best estimates were provided through estimators (6), (8) and (10) (see

In general, this simulation study shows that for the exponential model estimators (6), (7) and (9) presented similar values for the ratio; however, estimator (7) is preferable because it uses a reduced number of iterations to obtain reliable results. For the wave model, estimator (8) is preferred because it offers the most accurate estimates at the lowest level of computer time.

Both simulation studies show promising results that can help improve the accuracy of estimates when performing two-dimensional systematic sampling. The accuracy depends on factors that consider the structure of the population, and takes the relationships among the units in the sample and the use of simulated auxiliary information into account. In general, in the one-step simulation the best results are obtained with estimators constructed under the model-assisted approach and/or taking the presence of autocorrelation into account. When the population presents a structure such as the one produced by the exponential model, estimator (7) is recommended, which shows a periodic behavior for the autocorrelation. In contrast, for the wave model, the best estimator is (8), as the estimates improve as the autocorrelation increases. In the average simulation,

Estimator | Ratio | MSE | ||||
---|---|---|---|---|---|---|

1.5 | 2.8 | 4.8 | 1.5 | 2.8 | 4.8 | |

(3) | 1.73 | 2.37 | 2.64 | 1.68E−08 | 3.32E−08 | 3.88E−08 |

(4) | 1.64 | 2 | 1.97 | 1.72E−08 | 1.97E−08 | 1.37E−08 |

(5) | 0.98 | 0.79 | 0.64 | 1.24E−08 | 1.54E−08 | 1.50E−08 |

(6) | 1.29 | 1.38 | 1.28 | 1.26E−08 | 9.93E−09 | 6.75E−09 |

(7) | 1.06 | 1.43 | 1.55 | 1.53E−08 | 1.08E−08 | 8.95E−09 |

(9) | 1.06 | 1.42 | 1.56 | 1.93E−08 | 9.98E−09 | 7.50E−09 |

Estimator | Ratio | MSE | ||||
---|---|---|---|---|---|---|

1.5 | 2.8 | 4.8 | 1.5 | 2.8 | 4.8 | |

(3) | 5.84 | 9.55 | 10.94 | 1.58E−07 | 1.74E−07 | 1.78E−07 |

(4) | 4.31 | 4.43 | 3.39 | 7.25E−08 | 2.48E−08 | 9.76E−09 |

(5) | 1.3 | 1 | 0.69 | 3.62E−09 | 2.70E−08 | 6.08E−09 |

(6) | 3.45 | 1.85 | 1.6 | 6.58E−08 | 9.53E−09 | 4.08E−09 |

(7) | 2.9 | 4.68 | 5.26 | 2.20E−08 | 3.04E−08 | 2.75E−08 |

(8) | 2.14 | 2.17 | 1.72 | 8.61E−09 | 5.24E−09 | 8.86E−09 |

(9) | 2.9 | 4.67 | 5.37 | 2.17E−08 | 2.79E−08 | 2.43E−08 |

(10) | 2.14 | 2.17 | 1.68 | 2.38E−08 | 2.04E−08 | 1.73E−08 |

estimators from two different approaches (model-based vs. model-assisted) were compared using the same number of iterations. For the exponential and wave models the best accuracy measures are obtained with estimators (9) and (10), respectively. The estimator (6), which has values close to the estimators mentioned above, seems more robust as a choice of model, but its computation requires a great number of iterations. The estimates of (7) and (8) are as accurate as those obtained with estimators (9) and (10), using only one iteration.

As a result, the regression estimator (7) and its corrected version (8) by the Geary Index are recommended. Within the model-assisted approach, these estimators do not need a great number of iterations in order to achieve estimations as accurate as those obtained with the more complex approaches. Finally, since the systematic sampling method is broadly used in many research areas, this methodology is not limited to this particular problem (petroleum spills). It is easily adaptable to other cases where this sampling design is used, including linear cases.

This research was funded by the Mexico’s National Council of Science and Technology (Consejo Nacional de Ciencia y Tecnologia, CONACyT).

The authors declare no conflicts of interest regarding the publication of this paper.

Jarquin, D. (2018) Estimating the Variance of the Proportion of Contaminated Soil by Petroleum Spills Using Two-Dimensional Systematic Sampling under Different Approaches. Open Journal of Statistics, 8, 706-720. https://doi.org/10.4236/ojs.2018.84046