outliers are indicated by their associated α0 value.

find the value of by using a predetermined , the

probability of having at least good subsamples in .

In view of the breakdown probability function, this

amounts to selecting by requiring 0

k*

p

*

rk

k

*

=1BP p

,

which is a condition imposed on

BP

at a single

point 0

. An alternative way of selecting is to im-

pose a stronger condition on

k

BP

over some interval

of interest.

Note that everything else being equal, we can get a

more robust SUE by using a larger k. For practical

applications, however, we caution that a very large

will compromise both the computational efficiency of the

subsampling algorithm and the efficiency of the com-

bined sample

k

g

S as an estimator of the good data set.

The latter point is due to the fact that in practice, the

subsamples forming

g

S

*

r

are not independent random

samples from the good data set; in the extreme case

where k goes to infinity, the subsample with the smallest

γ-score will appear infinitely many times, and thus all

subsamples in the union of

g

S are repeats of this same

subsample. This leads to the lowest efficiency for

g

S

with

=ESnNk

Fgs . Thus when selecting the

value, it is necessary to balance the robustness and the

efficiency of the SUE.

To conclude Section 3, we note that although the

selection bias problem associated with the combined

sample

g

S can make the asymptotic theory of the SUE

difficult, it has little impact on the breakdown robustness

of the SUE. This is due to the fact that to study the

breakdown of the SUE, we are only concerned with

whether

g

S contains any outliers. As such, the size of

g

S and the possible dependence structure of points

within are irrelevant. Whereas in Section 3.1 we had to

make the strong assumption that

g

S is a random sample

from the good data in order to borrow the asymptotic

theory from the classical method , here we do not

need this assumption. Indeed, as we have pointed out

after (14) that the breakdown results in this section are

valid under only a weak assumption, that is, the

criterion employed is capable of separating good subsam-

ples from subsamples containing one or more arbitrarily

large outliers. Any reasonable should be able to do

so.

4. Applications of the Subsampling Method

In this section, we apply the subsampling method to three

real examples through which we demonstrate its use-

fulness and discuss issues related to its implementation.

For the last example, we also include a small simulation

study on the finite sample behaviour of SUE.

An important issue concerning the implementation of

the subsampling method which we have not considered

in Section 2 is the selection of classical method

and

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING

OJS

289

Copt © SciR yrigh 2012es.

goodness-of-fit criterion . For linear regression and

non-linear regression models, the least squares estimation

(LSE) method and the mean squared error (MSE) are

good choices for and , respectively, as the LSE

and MSE are very sensitive to outliers in the data.

Outliers will lead to a poor fit by the LSE, resulting in a

large MSE. Thus a small value of the MSE means a good

fit. For logistic regression and Poisson regression models,

the maximum likelihood estimation (MLE) method and

the deviance (DV) can be used as and , respec-

tively. The MLE and DV are also sensitive to outliers. A

good fit should have a small DV. If the ratio

DV e

np

is much larger than 1, then it is not a good fit.

Another important issue is the proper selection of the

working proportion of outliers or equivalently the (esti-

mated) number of outliers in the sample. This is

needed to determine the and to run the subsam-

pling algorithm. Ideally, the selected value should be

slightly above the true number of outliers as this will lead

to a robust and efficient SUE. If we have some infor-

mation about the proportion of outliers in the sample

such as a tight upper bound, we can use this information

to select . In the absence of such information, we may

use several values for to compute the SUE and

identify the most proper value for the data set in question.

For values above the true number of outliers, the

SUE will give consistent estimates for the model para-

meters. Residual plots will also look consistent in terms

of which points on the plots appear to be outliers. We

now further illustrate these issues in the examples below.

Example 2: Linear model for stackloss data

m

k

m

*

r

m

y

The well-known stackloss data from Brownlee [7] has

21 observations on four variables concerning the opera-

tion of a plant for the oxidation of ammonia to nitric acid.

The four variables are stackloss (), air flow rate (1

x

),

cooling water inlet temperature (2

x

) and acid concen-

tration (3

x

). We wish to fit a multiple linear regression

model,

0112233

=yxxx

m

m

to this data. We use the LSE and MSE for and

,

respectively, in the SUE. We also try three m values,

and 6, which represent roughly 10%, 20% and

30% working proportion of outliers in the data. The sub-

sample size is chosen to be the default size of s.

The corresponding values for and k in the SAL

and the estimates for regression parameters are given in

Table 2. For comparison, Table 2 also includes the

estimates given by the LSE and MME, a robust estimator

introduced in [6]. The residual versus fitted value plots

for the LSE and SUE are in Figure 5. Since the regres-

sion parameter estimates for the SUE with and

=2

,4m

=11n

*

r

=4m

(a) (b)

(c) (d)

Figure 5. Residual versus fitted value plots for Example 2: (a) LSE; (b) SUE with m = 2; (c) SUE with m = 4; (d) SUE with m

= 6. The dashed lines are ˆ

3, which are used to identify outliers.

M. TSAO, X. LING

290

Table 2. Regression parameter estimates with standard errors in brackets for Example 2.

SUE (m = 2) SUE (m = 4) SUE (m =6)

*=6r*=5r*=4r

parameter LSE MME

=5k=3k=2593k7 27

0

−39.92 (11.90) −41.52 (5.30) −38.59 (10.93) −37.65 (4.73) −36.72 (3.65)

0.72 (0.13) 0.94 (0.12) 0.77 (0.13) 0.80 (0.07) 0.84 (0.05)

1

2

1.30 (0.37) 0.58 (0.26) 1.09 (0.35) 0.58 (0.17) 0.45 (0.13)

3

−0.15 (0.16) −0.11 (0.07) −0.16 (0.14) −0.07 (0.06) −0.08 (0.05)

3.24 1.91 2.97 1.25 0.93

sample size =2

n=2n=2

e

n=1

e

n=15

e

n

1 1 0 7

=6m

=4

are consistent and the corresponding residual

plots identify the same 4 outliers, m is the most

reasonable choice. The effective sample size for

g

S

=4m=1n

when is e, and hence this 7

g

S

=2

includes

all the good data points in the data and the SUE is the

most efficient. It is clear from Table 2 and Figure 5 that

the LSE and the SUE with fail to identify any

outliers and their estimates are influenced by the outliers.

The robust MME identifies two outliers, and its estimates

for 12

m

,

and 3

are slightly different from those

given by the SUE wi=4. Since the MME is

usually biased in the estimation of the intercept 0

th m

, the

estite of 0

ma

from the MME is quite different. This

data set has been analysed by many statisticians, for

example, Andrews [8], Rousseeuw and Leroy [3] and

Montgomery et al. [9]. Most of these authors concluded

that there are four outliers in the data (observations 1, 3,

4 and 21), which is consistent with the result of the SUE

m

with =4.

m

=6m n

m

=4m

m

Note that the SUE is based on the combined sample

which is a trimmed sample. A large value assumes

more outliers and leads to heavier trimming and hence a

smaller combined sample. This is seen from the SUE

with where the effective sample size e is 15

instead of 17 for . Consequently, the resulting

estimate for the variance is lower than that for .

However, the estimates for the regression parameters

under and are comparable, reflecting the

fact that under certain conditions the SUEs associated

with different parameter settings of SAL algorithm are

all unbiased.

=4

=6=4m

Example 3: Logistic regression for coal miners data

Ashford [10] gives a data set concerning incidents of

severe pneumoconiosis among 371 coal miners. The 371

miners are divided into 8 groups according to the years

of exposure at the coal mine. The values of three vari-

ables, “years” of exposure (denoted by

x

) for each

group, “total number” of miners in each group, and the

number of “severe cases” of pneumoconiosis in each

group, are given in the data set. The response variable of

interest is the proportion of miners who have symptoms

of severe pneumoconiosis (denoted by ). The 8 group

proportions of severe pneumoconiosis are plotted in

Figure 6(a) with each circle representing one group.

Since it is reasonable to assume that the corresponding

number of severe cases for each group is a binomial

random variable, on page 432 of [9] the authors consi-

dered a logistic regression model for , i.e.,

Y

Y

01

01

exp

=.

1exp

x

EY

x

To apply subsampling method for logistic regression,

we choose the MLE method and the deviance DV as

and

, respectively. With groups, we set

and 2 in the computation, and set the subsample size to

. The corresponding values for and k are

=8N=1m

=5

s

n*

r

*,=4,23rk and

*,=3,76rk =1m

=1m

for and 2,

respectively. The original data set has no apparent out-

liers. In order to demonstrate the robustness of the SUE,

we created one outlier group by changing the number of

severe cases for the 27.5 years of exposure group from

original 8 to 18. Consequently, the sample proportion of

severe pneumoconiosis cases for this group has been

changed from the initial 8/48 to 18/48. Outliers such as

this can be caused, for example, by a typo in practice.

The sample proportions with this one outlier are plotted

in Figure 6(b). The regression parameter estimates from

various estimators are given in Table 3, where the M

method is the robust estimator from [11]. The fitted lines

given by the MLE, the SUE and the M method are also

plotted in Figure 6. For both data sets, the SUE with

and 2 gives the same result. The SUE does not

find any outliers for the original data set, while it

correctly identifies the one outlier for the modified data

set. For the original data set, the three methods give

almost the same estimates for the parameters, and this is

reflected by their fitted lines (of proportions) which are

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING 291

(a)

(b)

Figure 6. Sample proportions and fitted proportions for Example 3: (a) The original data set with no outlier group; (b)

Modified data with one outlier group. Note that in (a) the fitted lines are all the same for the MLE, SUE and M methods,

while in (b) they are different.

Table 3. Regression parameter estimates with standard errors in brackets for Example 3.

Original data With one Outlier group

Parameter

MLE MLE M

SUE

−4.80 (0.57) −4.07 (0.46) −4.80 (0.59) −5.24 (0.70)

0

1

0.09 (0.02) 0.08 (0.01) 0.09 (0.02) 0.10 (0.02)

nearly the same as can be seen in Figure 6(a). For the

modified data set, the SUE and the M method are robust,

and their fitted lines are in Figure 6(b). The outlier has

little or no influence on these fitted lines.

Example 4: Non-linear model for enzymatic reaction

data

To analyse an enzymatic reaction data set, one of the

models that Bates and Watts [12] considered is the well-

known Michaelis-Menton model, a non-linear model

given by

0

1

=,

x

yx

(16)

where 0

and 1

are regression parameters, and the

error

LSE (dotted) and SUE (solid), and Figure 7(b) shows

the residual plot from the SUE fit. Since there is only one

mild outlier in the data, the estimates from the LSE and

SUE are similar and they are reported in Table 4.

We also use model (16) to conduct a small simulation

study to examine the finite sample distribution of the

SUE. We generate 1000 samples of size where,

for each sample, 10 observations are generated from the

model with 0

=12N

=215

, 1=0.07

and a normally dis-

tributed error with mean 0 and =8

. The other 2

observations are outliers generated from the same model

but with a different error distribution; a normal error

distribution with mean 30

and =1

. The two out-

liers are outliers, and Figure 7(c) shows a typical

sample with different fitted lines for the LSE and SUE.

The estimated parameter values are also reported in

Table 4. For each sample, the SUE estimates are com-

puted with s, and . Figure 8 shows

the histograms for 01

y

=7n*=4r=63k

ˆˆ

ˆ

,,

2

has variance

. The data set has

observations of treated cases. Response variable is

the velocity of an enzymatic reaction and

=12N

y

x

is the

substrate concentration in parts per million. To compute

the SUE, we use the LSE method and the MSE for

and the sample size of

g

S

,,

.

The dotted vertical lines are the true values for 01

and , respectively, and set and s which

lead to and . Figure 7(a) shows the

scatter plot of versus

=2

=10n

0

ˆ

and . Table 5 shows the biases and standard

errors for the LSE and SUE based on the simulation

study. The distributions of the SUE estimators

m=7n

=6

*=4r3k

y

and

x

and the fitted lines for the

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING

292

(a) (b)

(c) (d)

Figure 7. Fitted regression lines for Example 4: (a) The LSE and SUE lines for the real data set; (b) SUE fit residual plot for

the real data set; (c) The LSE and SUE lines for the simulated data with outliers; (d) SUE fit residual plot for the simulated

data. Dotted lines in plots (b) and (d) are the ˆ

3

lines.

(a) (b)

(c) (d)

ˆ0ˆ1ˆ

Figure 8. Histograms for the SUE: (a) Estimator

; (b) Estimator

; (c) Estimator

; (d) The sample size of

g

S

. The

vertical dotted lines are the true values for 01

,,

n and .

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING 293

Table 4. Regression estimates with the standard errors in brackets for Example 4.

Original Simulated

Data set parameter

LSE (s.e.) SUE (s.e.) LSE (s.e.) SUE (s.e.)

0

212.68 (6.95) 216.62 (4.79) 210.35 (9.03) 217.30 (4.55)

1

0.064 (0.008) 0.072 (0.006) 0.073 (0.014) 0.069 (0.007)

10.93 7.10 15.23 6.47

Table 5. Results for the simulation study.

Parameter LSE bias (s.e.) SUE bias (s.e.)

0

−3.64 (8.84) 1.64 (8.93)

1

0.0059 (0.0238) 0.0052 (0.0266)

5.70 (1.51) −0.20 (2.71)

1

ˆ

are approximately normal, and the biases are much

smaller than that of the LSE. That of the estimated

variance also looks like a 2

distribution. The average

effective sample size of

g

S

=10n

*

,,nrk

is 9.93, which is very close

to the number of good data points . There are a

small number of cases where the effective sample size is

12. These are likely cases where the “outliers” generated

are mild or benign outliers and are thus included in the

combined sample.

5. Secondary Criterion and Other Variations

of the Subsampling Algorithm

The 5-step subsampling algorithm SAL (s) intro-

duced in Section 2 is the basic version which is straight-

forward to implement. In this section, we discuss modifi-

cations and variations which can improve its efficiency

and reliability.

5.1. Alternative Stopping Criteria for

Improving the Efficiency of the

Combined Subsample

g

S

In Step 5 of SAL (), the first subsamples in

*

,,nrk *

r

12 ,

k

s

the sequence

,,

A

AA*

r are identified as

good subsamples and taken union of to form

g

S

*

r

*

,,nrk *

. How-

ever, it is clear from the discussion on parameter selec-

tion in Section 2.2 that there are likely more than

good subsamples among the k generated by SAL

(s). When there are more than r good sub-

samples, we want to use them all to form a larger and

thus more efficient

g

S. We now discuss two alternatives

to the original Step 5 (referred to as Step 5a and Step 5b,

respectively) that can take advantage of the additional

good subsamples.

Step 5a: Suppose there is a cut-off point for the

scores, say C

, such that the jth subsample is good if

and only if

C

j

. Then we define the combined

subsample as

:

=

C

j

g.

j

j

SA

(17)

Step 5b: Instead of a cut-off point, we can use

1

as

a reference point and take the union of all subsamples

whose

scores are comparable to

1

. That is, for a

pre-determined constant >1

, we define the combined

subsample as

1

:

=

j

g.

j

j

SA

(18)

In both Steps 5a and 5b, the number of subsamples in

the union are not fixed. The values of C

and

depend on ,

n

s

n,

,

and the underlying model.

Selection of C

and

may be based on the distri-

bution of

-scores of good subsamples.

If either Step 5a or 5b is used instead of the original

Step 5, we need to ensure the number of subsamples

taken union of in (17) or (18) is no less than . If it is

less than , then the criterion based on C

*

r

*

r

or

may

be too restrictive and it is not having the desired effect of

improving the efficiency of the original Step 5. It may

also be that the number of good subsamples is less than

and in this case, a re-run of SAL with a larger is

required.

*

r k

*

r

Finally, noting that the subsamples making up

g

S

*

r

k

,,

in Step 5 may not be distinct, another way to in-

crease efficiency is to use distinct subsamples. The

number of good subsamples in a sequence of distinct

subsamples follows a Hypergeometric (

g

T) dis-

tribution, where

kL L

g

L is the total number of good sub-

samples of size

s

n L

L

and T the total number of sub-

samples of this size. Since T is usually much larger

than

g

L

k

*

r*

p

i

, the hypergeometric distribution is approxi-

mately a binomial distribution. Hence the required

for having good subsamples with probability is

approximately the same as before.

5.2. Consistency of Subsamples and Secondary

Criterion for Improved Reliability of the

Subsampling Algorithm

j

β

and

β

be the estimates given by (method Let

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING

294

applied to) the ith and jth subsamples, respectively. Let

,

ij

dβ

β

be a distance measure. We say that these two

subsamples are inconsistent if c where

c is a fixed positive constant. Conversely, we say that

the two subsamples are consistent if c

,>

ij

ddββ

d

,

ij

dd

ββ .

Inconsistent subsamples may not be pooled into the com-

bined sample

g

S to estimate the unknown

β

.

Step 5 of SAL () relies only on the γ-ordering

of the subsamples to construct the combined sample

*

,,

s

nrk

g

S.

In this and the two variations above,

g

S is the union of

( or a random number of) subsamples with the

smallest γ-scores. However, an

*

r

g

S constructed in this

manner may fail to be a union containing only good

subsample as a small γ-score is not a sufficient condition

for a good subsample. One case of bad subsamples with

small γ-scores we have mentioned previously is that of a

bad subsample consisting of entirely outliers that also

follow the same model as the good data but with a dif-

ferent

β

. In this case, the γ-score of this bad subsample

may be very small. When it is pooled into the

g

S,

outliers will be retained and the resulting SUE will not be

robust. Another case is when a bad subsample consists of

some good data points and some outliers but the model

happens to fit this bad subsample well, resulting in a very

small γ-score. In both cases, the criterion is unable to

identify the bad subsamples but the estimated

β

based

on these bad subsamples can be expected to be incon-

sistent with that given by a good subsample.

To guard against such failures of the criterion, we

use the consistency as a secondary criterion to increase

the reliability of the SAL (). Specifically, we

pool only subsamples that are consistent through a

modified Step 5, Step 5c, given below.

k

*

,,

s

nr

i

Step 5c: Denote by

β

the estimated value of

β

based on ()i

A

where and let =1,2, ,ik

,

ij

dβ

β

be a distance measure between two estimated values.

Take the union

*

=1

=

j

r

gi

j

S

,A (19)

where

j

i

A

(jr

ii ) are the first consis-

tent subsamples satisfying

*

12

=,, ,i i

,

x

jl

ii

jl

iiββ

c

d

*

r

,,,

ma c

dd

for some predetermined constant in

12 k

A

AA

.

The criterion is the primary criterion of the sub-

sampling algorithm as it provides the first ordering of the

subsamples. A secondary criterion divides the γ-

ordered sequence into consistent subsequences and per-

forms a grouping action (instead of an ordering action)

on the subsamples. In principle, we can switch the roles

of the primary and secondary criteria but this may sub-

stantially increase the computational difficulties of the

algorithm. Additional criteria such as the range of the

elements of

β

k

may also be added.

With the secondary criterion, the first subsamples

taken union of in Step 5c has an increased chance of all

being good subsamples. While it is possible that they are

actually all bad subsamples of the same kind (consistent

bad subsamples with small γ-scores), this is improbable

in most applications. Empirical experience suggests that

for suitably chosen metric

*

r

,

ij

dβ

β

, threshold and

a reasonably large subsample size

c

d

s

n=0.5 1nN (say s

),

the first subsamples are usually consistent, making

the consistency check redundant. However, when the

first subsamples are not consistent and in particular

when *

r, it is important to look for an explanation.

Besides a poorly chosen metric or threshold, it is also

possible that the data set actually comes from a mixture

model. Apart from the original Step 5, a secondary criterion

can also be incorporated into Step 5a or Step 5b.

*

r

*

r

*

ir

*=1r

*=1r

k

Finally, there are other variations of the algorithm

which may be computationally more efficient. One such

variation whose efficiency is difficult to measure but is

nevertheless worth mentioning here requires only

good subsample to start with. With , the total

number of subsamples required () is substantially re-

duced, leading to less computation. Once identified, the

one good subsample is used to test points outside of it.

All additional good points identified through the test are

then combined with the good subsample to form a com-

bined sample. The efficiency of such a combined sample

is difficult to measure and it depends on the test. But

computationally, this approach is in general more effi-

cient since the testing step is computationally inexpen-

sive. This approach is equivalent to a partial depth func-

tion based approach proposed in [13].

6. Concluding Remarks

It is of interest to note the connections among our sub-

sampling method, the bootstrap method and the method

of trimming outliers. With the subsampling method, we

substitute analytical treatment of the outliers (such as the

use of the

functions in the M-estimator) with com-

puting power through the elimination of outliers by re-

peated fitting of the model to the subsamples. From this

point of view, our subsampling method and the bootstrap

method share a common spirit of trading analytical

treatment for intensive computing. Nevertheless, our sub-

sampling method is not a bootstrap method as our objec-

tive is to identify and then make use of a single combined

sample instead of making inference based on all boots-

trap samples. The subsampling based elimination of out-

liers is also a generalization of the method of trimming

outliers. Instead of relying on some measure of outlying-

Copyright © 2012 SciRes. OJS

M. TSAO, X. LING

Copyright © 2012 SciRes. OJS

295

REFERENCES ness of the data to decide which data points to trim, the

subsampling method uses a model based trimming by

identifying subsamples containing outliers through fitting

the model to the subsamples. The outliers are in this case

outliers with respect to the underlying regression model

instead of some measure of central location and they are

only implicitly identified as being part of the data points

not in the final combined sample.

[1] P. J. Huber, “Robust Statistics,” Wiley, New York, 1981.

[2] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W.

A. Stahel, “Robust Statistics: The Approach Based on In-

fluence Functions,” Wiley, New York, 1986.

[3] P. J. Rousseeuw and A. M. Leroy, “Robust Regression

and Outlier Detection,” Wiley, New York, 1987.

[4] R. A. Maronna, R. D. Martin and V. J. Yohai, “Robust

Statistics: Theory and Methods,” Wiley, New York, 2006.

The subsampling method has two main advantages: it

is easy to use and it can be used for any regression model

to produce robust estimation for the underlying ideal

model. There are, however, important theoretical issues

that remain to be investigated. These include the charac-

terization of the dependence structure among observa-

tions in the combined sample, the finite sample distri-

bution and the asymptotic distribution of the SUE. Un-

fortunately, there does not seem to be a unified approach

which can be used to tackle these issues for SUEs for all

regression models; the dependence structure and the

distributions of the SUE will depend on the underlying

regression model as well as the method and criterion

chosen. This is yet another similarity to the bootstrap

method in that although the basic ideas of such com-

putation intensive methods have wide applications, there

is no unified theory covering all applications and one

needs to investigate the validity of the methods on a case

by case basis. We have made some progress on the sim-

ple case of a linear model. We continue to examine these

issues for this and other models, and hope to report our

findings once they become available.

[5] D. G. Simpson, D. Ruppert and R. J. Carroll, “On One-

Step GM-Estimates and Stability of Inferences in Linear

Regression,” Journal of the American Statistical Associa-

tion, Vol. 87, No. 418, 1992, pp. 439-450.

[6] V. J. Yohai, “High Breakdown-Point and High Efficiency

Estimates for Regression,” Annals of Statistics, Vol. 15,

No. 2, 1987, pp. 642-656. doi:10.1214/aos/1176350366

[7] K. A. Brownlee, “Statistical Theory and Methodology in

Science and Engineering,” 2nd Edition, Wiley, New York,

1965.

[8] D. F. Andrews, “A Robust Method for Multiple Linear

Regression,” Technometrics, Vol. 16, No. 4, 1974, pp.

523-531.

[9] D. C. Montgomery, E. A. Peck and G. G. Vining, “Intro-

duction to Linear Regression Analysis,” 4th Edition,

Wiley, New York, 2006.

[10] J. R. Ashford, “An Approach to the Analysis of Data for

Semi-Quantal Responses in Biological Assay,” Biomet-

rics, Vol. 15, No. 4, 1959, pp. 573-581.

doi:10.2307/2527655

[11] E. Cantoni and E. Ronchetti, “Robust Inference for Gen-

eralized Linear Models,” Journal of American Statistical

Association, Vol. 96, No. 455, 2001, pp. 1022-1030.

doi:10.1198/016214501753209004

7. Acknowledgements

[12] D. M. Bates and D. G. Watts, “Nonlinear Regression

Analysis and Its Applications,” Wiley, New York, 1988.

The authors would like to thank Dr. Julie Zhou for her

generous help and support throughout the development

of this work. This work was supported by a grant from

the National Science and Engineering Research Council

of Canada.

[13] M. Tsao, “Partial Depth Functions for Multivariate Data,”

Manuscript in Preparation, 2012.

M. TSAO, X. LING

296

Appendix: Proofs of Theorems 1 and 2

Proof of Theorem 1:

Let

g

p be the probability that random subsample of

size

s

n from

N

S

>0p

**

,,AA

l

,,AA

>0

*

is a good subsample containing no

outliers. Since , we have .

s g

With probability 1, the subsequence 12 is an

infinite sequence. This is so because the event that this

subsequence contains only finitely many subsamples is

equivalent to the event that there exists a finite such

that 1ll contains no good subsamples. Since

g, the probability of this latter event and hence that

of the former event are both zero. Further, under the

condition that

nn

p

j

n

A

S, *

j

A

may be viewed as a ran-

dom sample of size

s

n

**

,,AA

taken directly without replace-

ment from n. Hence with probability 1, 12 is

an infinite sequence of random samples from the finite

population n. It follows that for any

S

S

j

n

zS

, the

probability that it is in at least one of the *

i

A

is 1. Hence

n. This and the fact that imply

the theorem.

PS B

n

BS

*

=BA

=1

Proof of Theorem 2:

We prove (4) by induction.

For , since 11

which contains exactly =1j

s

n

points, is a constant and

1

W

1

==.

1

s

F

EB EW n

nn

=1j

E

Hence (4) is true for .

To find 2F, denote by

B*

1

A

the complement of

*

1

A

with respect to n. Then S*

1

A

contains

s

nn

points. Since 2

*

A

is a random sample of size

s

n

S

taken

without replacement from n, we may assume it

contains 1 data points from U*

1

A

and 1s

nU

data

points from *

1

A

. It follows that has a hypergeo-

metric distribution

1

U

,,,

ss

nnnn

1HypergU (20)

with expected value

1

EU =.

s

s

nn

nn

**

21

=BAA

1

.U

21

2

==

=1 .

s

sss

s

nn

EWnEUn nn

nn

nn

Since 2

, its number of data points

Hence

2

=

s

Wn

It follows that

2

2=1,

s

F

nn

EB n

=2j

12j

j

#

and formula (4) is also true for .

Now assume (4) holds for some . We show

that it also holds for . Denote by

A

the number

of points in

A

. Then

*

=# =#=

11

1

j

jjjjj

WB BAWU

, where

1

U

j

is

the number of good data points in

j

B1

but not in B

j

.

The distribution of

j

U1

conditioning on 11

=

j

j

Ww

is

11 11

=Hyperg,,,

jj jjs

UW wnnwn

1j

(21)

By (21) and the assumption that (4) holds for

,

we have

111

1

1

1

1

1

=

=

=

=1

==1.

jj jj

j

js

s

sj

j

s

ss

jj

ss

j

EWEWEEUW

nW

EWE nn

nn

nEW

n

nn

nnn n

nn nn

nn

n

n

It follows that

j

==1 .

js

Fj

EW nn

EB nn

j

Thus (4) also holds for , which proves Theorem 2.

Copyright © 2012 SciRes. OJS