Open Journal of Statistics
Vol.4 No.3(2014), Article ID:44325,10 pages DOI:10.4236/ojs.2014.43017
Sampling Designs with Linear and Quadratic Probability Functions
Lennart Bondesson1, Anton Grafström2, Imbi Traat3
1Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden
2Department of Forest Resource Management, Swedish University of Agricultural Sciences, Umeå, Sweden
3Institute of Mathematical Statistics, University of Tartu, Tartu, Estonia
Email: lennart.bondesson@math.umu.se, anton.grafstrom@slu.se, imbi.traat@ut.ee
Copyright © 2014 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/
Received 20 January 2014; revised 20 February 2014; accepted 27 February 2014
ABSTRACT
Fixed size without replacement sampling designs with probability functions that are linear or quadratic functions of the sampling indicators are defined and studied. Generality, simplicity, remarkable properties, and also somewhat restricted flexibility characterize these designs. It is shown that the families of linear and quadratic designs are closed with respect to sample complements and with respect to conditioning on sampling outcomes for specific units. Relations between inclusion probabilities and parameters of the probability functions are derived and sampling procedures are given.
Keywords: Complementary Midzuno Design; Conditional Sample; Inclusion Probability; Midzuno Design; Mixture of Designs; Parameters of Design; Sample Complement; Sinha Design
1. Introduction
In the first part of the paper, we consider the most general fixed size without replacement sampling design with a probability function that is linear in the sampling inclusion indicators—the linear sampling design. The linear design, being an unequal probability design, is remarkable due to simple explicit relations between inclusion probabilities and parameters of the probability function. This enables sampling with desired inclusion probabilities, design-unbiased estimation and variance estimation. As special cases, the design covers the classical Midzuno [1] and the complementary Midzuno (see [2] ) designs. It is shown that the linear design can be seen as a mixture of the two types of Midzuno designs. It is also shown that the family of linear designs is closed with respect to conditioning on sampling outcomes. This property, as well as the mixture representation, offers easy methods for sampling from the linear design. The family is also closed with respect to sample complements, i.e. the complement of a sample from a linear design is a sample from another linear design.
In the second part of the paper, the fixed size without replacement design with quadratic probability function—the quadratic sampling design—is defined and studied. It is the natural extension of the linear design. A classical design by Sinha [3] is a special case of the quadratic design. His design aimed at sampling with prescribed second-order inclusion probabilities. Together with Sinha’s design, two other special quadratic designs are studied more closely. These three designs are easy to use for sampling. There are explicit expressions for the second-order inclusion probabilities of Sinha’s design. In the general case, the formulas are more complicated. A lemma is proved which relates the second-order inclusion probabilities and the parameters of the quadratic design. Like the family of linear designs, the family of quadratic designs is closed with respect to sample complements and with respect to conditioning on sampling outcomes. The last property makes list-sequential sampling from these designs efficient.
The linear and the quadratic designs are simple but somewhat restricted when the aim is sampling with prescribed firstor second-order inclusion probabilities. They cannot be used for all such probabilities. In the final section of the paper, possible extensions are mentioned.
2. Linear Sampling Designs
We treat without replacement (WOR) sampling designs of size n from a population of size N. Let be the binary random sampling inclusion vector. A sampling design is given by its probability function
Let
i.e.
is the set of all possible “samples”
of size n.
Definition 1. A sampling design of size is called linear if there are real constants
such that
Of course, equal cks give SRSWOR. Below it is always assumed that the cks are normalized to have sum 1.
Then where
In fact,
The inclusion probabilities of the linear design are given by
(1)
Indeed, since
Since (1) shows that we must have
The
s can be expressed in terms of the
s as
(2)
By similar algebra, the second-order inclusion probabilities are given by
and hence
(3)
i.e. the s are linear in the first-order inclusion probabilities.
Without restriction we may assume that the cks are given in increasing order. Obviously, in order to get it is then necessary and sufficient that
We obtain the following theorem.
Theorem 1. Let be given numbers with sum n. Then there is a linear sampling design with these numbers as inclusion probabilities if and only if
Proof. By the relation (2), we see that
To sample from a linear sampling design, we may use the well-known acceptance-rejection (AR) technique.
A constant A such that for all
is first found. Assuming that the cks are in increasing order, it suffices to put
Then we generate a tentative sample x according to SRSWOR of size n and a random number
The sample is accepted as a sample from the linear design if
The full procedure is repeated until a sample is accepted. The acceptance rate equals To be seen later on, also other sampling techniques exist.
There are two special linear sampling designs: the Midzuno design and the complementary Midzuno design.
The Midzuno design has the probability function with
and
for all k. Apparently it is a linear design. We may sample from the design by selecting one unit according to the probabilities
and then
further units according to SRSWOR from the remaining
ones. The design was introduced in [1] (considering the
s as proportional to auxiliary variables). Tillé ([4] , p. 117) generalizes it and gives many further references concerning the design. Brewer and Hanif ([5] , p. 25) remark that the inequalities on the inclusion probabilities are restrictive.
The complementary Midzuno design (see [2] ) has the probability function
with and
for all
This design is considered in [6] with the
s proportional to auxiliary variables. We may sample from the design by removing one unit according to the probabilities
and sample
units by SRSWOR among the remaining
ones. For the complementary Midzuno design, we then get
The formula for the second-order inclusion probabilities coincides with (3). Since
we also see that the complementary Midzuno design is a linear design with coefficients with sum 1. If
for all k, it is even a Midzuno design and hence the two designs overlap each other.
Remark 1. Let have sum n, i.e.
Then, by (2), these numbers are inclusion probabilities for a Midzuno design iff (a)
Obviously (a) implies the less restrictive inequality in Theorem 1. The
s are inclusion probabilities of a complementary Midzuno design iff (b)
Of course, also (b) implies the inequality in Theorem 1. In fact, assuming the contrary that
we get by (b) that
which is a contradiction as the sample size is n.
The family of linear designs can be considered as closed with respect to sample complements. More precisely, the complement of any sample of size from a linear design, is a sample of size
from another linear design. This follows from the relation
Another interesting property of the family of linear designs is that it is closed with respect to conditioning on the sampling outcomes for specific units. For instance, if we know that unit 1 is selected (i.e.), then the probability function for the remaining size
sampling among
units is still linear. In fact, given that
and that
we have by (1) that
The conditional probability function of
is then given by
The new coefficients have sum
by the inequality for
above. Normalizing them, we get coefficients
with sum 1. If it is known that unit 1 is not selected (
and
), the new coefficients are simply given by
It follows that samples from a linear design can easily be generated list-sequentially. The first unit in the population is sampled with probability If it is selected, then the second unit is sampled with probability
Else the second unit is sampled with probability
etc.
3. A Mixture Result
A linear design can also be called a mixed Midzuno design because of the following theorem.
Theorem 2. A linear sampling design where
is a mixture of a Midzuno design and a complementary Midzuno design. It is a Midzuno design if
for all k and it is a complementary one if
for all k.
The components of the mixture are not unique. The very last statement in the theorem follows from the re-writing The full proof of the theorem is somewhat technical and is given in an appendix. The proof yields the following procedure for finding suitable components of the mixture. It is assumed that the design is not a pure Midzuno design and that the units in the population are ordered such that
where
(since
).
Procedure:
Let While
let
(As the final
is less than
)
Then put,
, and
Finally, let the parameters for the components be defined by
The first component, the Midzuno one, is used with probability and the second component, the complementary Midzuno one, is used with probability
Example 1. Let,
, and
Obviously
It is not possible to use
since
The procedure gives us
and
and
and
However, since
the design is also a pure complementary Midzuno design with
and hence
Theorem 2 implies another simple way of generating samples from the linear design. First it is decided by a random choice whether a Midzuno or a complementary Midzuno design should be used. Then one of these designs is applied. In Tillé’s ([4] , pp. 99-104) terminology, we can see this technique as a special splitting technique which after two steps ends with SRSWOR.
4. Quadratic Sampling Designs
There is a natural extension of the linear designs. In this section we look at fixed size designs with quadratic probability functions. We could use ordinary quadratic forms. However, it is more appropriate to sum over sets of size 2 than over pairs with ordered elements. Below means a sum over the
sets
,
.
Definition 2. A sampling design of size is called quadratic if there are real constants
with
and
such that
In particular, if then
is the probability to select the sample
The normalization constant can be shown to be equal to the reciprocal of
. It can also be shown that the parameters
are uniquely determined by the design. To get
for all
it is necessary and sufficient that
for all subsets A of
of size n. This means that the possible parameters
form a convex set in
where
However, this set is rather complicated if n is not very small.
For a linear design can also be seen as a quadratic design, because
,
. There are three natural quadratic designs with their basic parameters symmetric and nonnegative:
(a)
(b)
(c)
Assuming that
and
we readily find that for the three cases the normalization constants are the reciprocals of
and
respectively. It is not clear that the designs (b) and (c) are quadratic according to Definition 2 but it will be obvious later on. It will also follow that the complement of a sample from a quadratic design of size n can be seen as a sample from a quadratic design of size
The design (a) corresponds to selecting one pair of units according to the probabilities and then
further ones by SRSWOR. The design (b) was considered by Sinha in [3] . Remove one pair of units according to the probabilities
with sum 1 and then choose n units by SRSWOR among the remaining ones. The design (c) corresponds to selecting one pair according to the probabilities
and then keeping one of the units for the sample and removing the other one. Then
units are selected by SRSWOR among the
non-selected units.
Remark 2. An extension of the design (c) would be to let with
and
but not necessarily
In this case the units i and j have different roles. In the symmetric case the new parameters
equal the previous ones multiplied by 1/2. This extension is not considered further here.
Using the facts that for
and
and
it is not difficult to see that also the designs (b) and (c) are indeed quadratic designs according to Definition 2 with parameters that are not necessarily nonnegative. More explicitly we have, with
One can then realize that where
(4)
Further
Given parameters with sum 1 we may solve the equations
for possible nonnegative solutions
and the equations
for possible nonnegative solutions
In fact, we have
(5)
where and
This result is a consequence of formula (4) but with the sample of size
replaced by its complement of size
Somewhat more complicated calculations show that, at least if
(6)
This gives us the following theorem.
Theorem 3. A quadratic sampling design with
is a design of type (a) iff
It is design of type (b) (Sinha design) iff
It is a design of type (c) iff
It can happen that a design is of all the three types simultaneously. For example, SRSWOR is such a design.
Example 2. Let and let the design be defined by
By (5) and (6), the corresponding band c-parameters are then given by
Thus in this case the design is not of type (a) or (c) but of type (b).
In the linear case we saw that a general linear design can be represented as a mixture of a Midzuno and a complementary Midzuno design. It would be desirable that a general quadratic design could be represented as a mixture of the three designs (a), (b), and (c) (or its extension). However, such a result is not likely to hold.
For the design (b) the second-order inclusion probabilities (which determine the first-order ones since
) are easy to find. We have
It is also easy to find the b-parameters that correspond to In fact,
This result is due to Sinha [3] but more generally also a special case of Lemma 1 below. Sinha used his result to find a method to sample with prescribed second-order inclusion probabilities. However, for the method to work with nonnegative strong restrictions on the
s are needed.
Lemma 1. Let be given numbers with
Then the equations
(7)
in the unknowns have the solution
(8)
We only have to put to get the earlier solution
from the lemma. The proof of Lemma 1 is given in the appendix.
For a design given by the formulas for the second-order inclusion probabilities
are somewhat complicated. However, we can first find
from formula (5) and then use the
formula above. The inverse procedure
can also be used to find d-parameters corresponding to given second-order inclusion probabilities. The “mixed case” (c) can be handled in the same way. A computer program makes these procedures simple to use.
To sample from the general quadratic design, we could use an AR-technique by dominating by a multiple of
However, it is probably simpler to use list-sequential sampling. In fact, given
(0 or 1), the probability function for the remaining sample of size
or
is again quadratic, cf. Section 2. But we must recalculate parameters (coefficients) and calculate first-order inclusion probabilities. The first-order inclusion probability
is given by (assuming that the d-sum is 1)
Given that with
the new d-coefficients (with sum 1) are just
If and
the new d-coefficients (with sum 1) are instead given by
For the conditional design is a linear design with parameters proportional to
5. Additional Comments
As recently becoming more common in sampling articles, we have used sampling indicators and focused on the probability function of these. The names linear design and quadratic design become very natural for this reason.
There are several advantages of the linear design: there are simple relations between the parameters and the inclusion probabilities of first and second order. It is also easy to sample. Thus the design is easy to use and for the Horvitz-Thompson estimate of a population total, we can also readily find a variance estimate. The drawback of the design is its lack of sufficient flexibility: although much more general than SRSWOR, it is not able to cover all possible first-order inclusion probabilities.
The quadratic design is more complicated than the linear one. By using a quadratic design, it is possible to sample with prescribed second-order inclusion probabilities. However, it cannot be used for all such possible inclusion probabilities and therefore this design is also not flexible enough.
To get more flexibility, the linear and quadratic functions need to be complemented by binomial factors. To sample with prescribed first-order inclusion probabilities, we may generally use a probability function of the form
where the parameters are suitably chosen. At least three well-known designs are of this form (with nonnegative): the conditional Poisson design, the Sampford design, and the Pareto design, cf. [7] . For these three cases, the linear function
is proportional to
with nonnegative parameters
To sample with prescribed second-order inclusion probabilities, we may use a design with
Letting (the desired first-order inclusion probabilities) and using Lemma 1, it is possible to calculate the suitable parameters
without much effort. Details will be presented in a planned forthcoming paper. This design is much more flexible than the quadratic design but it has not full flexibility. A fully flexible design uses a probability function of the form
but it is not easy to find the parameters [8] .
Acknowledgements
This research was supported by the Estonian Science Foundation grant 8789.
References
- Midzuno, H. (1952) On the Sampling System with Probability Proportional to Sum of Sizes. Annals of the Institute of Statistical Mathematics, 3, 99-107. http://dx.doi.org/10.1007/BF02949779
- Bondesson, L. and Traat, I. (2013) On Sampling Designs with Ordered Conditional Inclusion Probabilities. Scandinavian Journal of Statistics, 40, 724-733. http://dx.doi.org/10.1111/sjos.12024
- Sinha, B.K. (1973) On Sampling Schemes to Realize Preassigned Sets of Inclusion Probabilities of First Two Orders. Calcutta Statistical Association Bulletin, 22, 89-110.
- Tillé, Y. (2006) Sampling Algorithms. Springer, New York.
- Brewer, K.R.W. and Hanif, M. (1983) Sampling with Unequal Probabilities. Lecture Notes in Statistics, No. 15, Springer-Verlag, New York. http://dx.doi.org/10.1007/978-1-4684-9407-5
- Wywiał, J. (2000) On Precision of Horvitz-Thompson Strategies. Statistics in Transition, 4, 779-798.
- Bondesson, L., Traat, I. and Lundqvist, A. (2006) Pareto Sampling versus Conditional Poisson and Sampford Sampling. Scandinavian Journal of Statistics, 33, 699-720. http://dx.doi.org/10.1111/j.1467-9469.2006.00497.x
- Bondesson, L. (2012) On Sampling with Prescribed Second-Order Inclusion Probabilities. Scandinavian Journal of Statistics, 39, 813-829. http://dx.doi.org/10.1111/j.1467-9469.2012.00808.x
Appendix
Here proofs are given of Theorem 2 and Lemma 1.
Proof of Theorem 2. Excluding the case that for all
and then ordering the
s as
where
we shall show that
(9)
where and
and
and
and
,
. Re-writing then
as
(with
), we have the desired mixture representation
We shall show that (9) can be achieved by putting for some such that
and
for
and
and
for
To get
and
we must have
and
To get
and
we must have
(10)
(Then there is no problem to choose such that
) We start by trying
Since
and
the first inequality in (10) is certainly satisfied with strict inequality. If also the second one is satisfied, we can stop and use
Otherwise, the second inequality is not satisfied and then
Equivalently
(and
). If now
we can stop and put
Otherwise
Equivalently
If now
we can stop and put
The procedure may continue until
But
(as
for all
) and we can then put
and stop.
Proof of Lemma 1. Let and
Then (7) can be re-expressed as
It follows that
and hence Thus
Summing then over
and
we get
and hence
This shows that the solution must be of the form (8). It is not difficult to check that (8) also is a solution.