1. Introduction

OJS

Open Journal of Statistics

2161-718X

Scientific Research Publishing

10.4236/ojs.2013.31002

OJS-27914

Articles

Physics&Mathematics

Detecting Global Influential Observations in Liu Regression Model

boobacker

Jahufer

₁^*

Department of Mathematical Sciences, Faculty of Applied Sciences, South Eastern University of Sri Lanka, Sammanthurai, Sri Lanka

* E-mail:jahufer@yahoo.com

20022013

0301511May 14, 2012June 30, 2012 July 10, 2012

2014

This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/

In linear regression analysis, detecting anomalous observations is an important step for model building process. Various influential measures based on different motivational arguments and designed to measure the influence of observations on different aspects of various regression results are elucidated and critiqued. The presence of influential observations in the data is complicated by the presence of multicollinearity. In this paper, when Liu estimator is used to mitigate the effects of multicollinearity the influence of some observations can be drastically modified. Approximate deletion formulas for the detection of influential points are proposed for Liu estimator. Two real macroeconomic data sets are used to illustrate the methodologies proposed in this paper.

Liu Estimator; Global Influential Observations; Diagnostics; Multicollinearity; Case Deletion; Approximate Deletion Formulas

1. Introduction

The presence of multicollinearity in the regressors seriously affects the parameter estimation and prediction. Therefore, mixed estimation and ridge type regression are suggested to mitigate this effect. In addition to multicollinearity, the presence of influential observations in the observed data is seriously affected by the estimators as those estimators are not unbiased [1,2].

In literature, many authors [1-3] noted that the influential observations on ridge type estimators are different from the corresponding least squares estimate and that multicollinearity can even disguise anomalous data. Belsley [3] investigated the leverage in ridge regression. Walker and Bitch [2] studied the influence of observations in ordinary ridge regression estimator (ORRE) based on case deletion method. Shi [4] proposed the local influence in principal component analysis by defining a generalized Cook statistic and showed that his method is equivalent to Cook’s approach under the likelihood framework. Shi and Wang [2] analyzed the influential cases in ORRE using local influence method. Jahufer and Chen [5] studied global influential observations in modified ridge regression estimator (MRRE) moreover, Jahufer and Chen [6] studied local influential observations in MRRE. Besides, Jahufer and Chen [7] analyzed local influential observations in Liu estimator.

The main aim of this paper is therefore to assess the global influence of observations in the linear Liu estimator using the method of case deletion. This method has been extensively studied and it is very powerful for detecting influential cases because of its intuitive appeal and its direct connection to the sample influence curves. Also, it is widely accepted as the foundation of many other statistical methods approached. The methodology proposed in this paper is illustrated using two real macroeconomic data sets. The first Data set is macro impact of foreign direct investment in Sri Lanka. This data set contains four regressors and a response variable with 27 observations. The second data set is Longley [8] data set. It consists of six regressors and a response variable with 16 observations. This paper is composed of six sections: Section 2 gives the background and definition of influential measures in least squares; Section 3 derives the influence measures in Liu estimator; Section 4 describes approximate deletion formulas for Liu estimator; Section 5 reports the examples using two real data sets. Discussion is given in the last section.

2. Background and Definition2.1. Background

A matrix multiple linear regression model can be written as

(1)

where y is an n × 1 response vector, X is an n × p centered and standardized known matrix (i.e. the length of the column of X is standardized to one), is a p × 1 vector of an unknown parameter, e is an n × 1 error vector with and and is an identity matrix of order n. Then the ordinary least squares estimator (OLSE) of is The estimator of is where residual vector

This article assumes that the reader is acquainted with the basic ideas of leverage and influence in regression analysis, as presented, for instance, in the works of [1] and [9].

2.2. Definition of Influential Measures in Least Squares

The general purpose of influential analysis is to measure the changes induced in a given aspect of the analysis when the data set is perturbed. A particularly appealing perturbation scheme is case deletion. Note that this scheme is used throughout this article.

In general, the influence of a case can be viewed as the product of two factors: the first a function of the residual and the second a function of the position of the point in the X space. The position or leverage of the i-th point is measured by h_i, which is the i-th diagonal element of the “hat” matrix.

Among the most popular single-case influential measure is the difference in fit standardized DFFITS [1], which evaluated at the i-th case is given by

(2)

where is the least squares estimator of without the i-th case and is an estimator of the standard error (SE) of the fitted values. DFFITS is the standardized change in the fitted value of a case when it is deleted. Thus it can be considered a measure of influence on individual fitted values.

Another useful measure of influence is Cook’s D [9], which evaluated at the i-th case is given by

(3)

where is a measure of the change in all of the fitted values when a case is deleted. Even though is based on different theoretical consideration, it is closely related to DFFITS.

Points with large values of have considerable influence on the least squares estimate. In general, points for which to be influential. In DFFITS measures any observation for which

warrants attention.

It is important to mention that these measures are useful for detecting single cases with an unduly high influence. For generalizations of Equations (2) and (3) for detecting influential sets (see [1,9]).

3. Influence Measures in Liu Estimator3.1. Liu Estimator

The Liu estimator was introduced by Liu [10] and is defined as

(4)

where I is an identity matrix, is OLSE and d is Liu estimator biasing parameter and it is. The Liu estimator combines the ORRE [11,12] estimator. The ORRE is effective in practice, but it is complicated function of its biasing parameter. Thus we often meet some complicated equations when we use some popular methods, such as ([13], C_k criterion [14], GCV criterion [15]) and etc. to choose ridge regression biasing parameter k. The advantage of Liu estimator over ORRE is that Liu estimator is a linear function of its biasing parameter d. Therefore, it is convenient to choose Liu estimator biasing parameter d.

The Liu estimator is very useful to mitigate the effect of near multicollinearity. Also, the recent literature, particularly in the area of econometrics, engineering and other statistical areas, the Liu estimator has produced a number of new techniques and ideas for example [5-7, 16-20].

3.2. Leverage and Residual Measures in Liu Estimator

Using Equation (4), the vector of fitted value of Liu estimator is

where is Liu estimator “hat matrix” (see [2,10]) and plays the same role as the hat matrix in OLSE. It is important to note that the matrix H_d is not a projection matrix because it is not idempotent and H_d is called quasi-projection matrix (see [2]). The i-th fitted value can be written in terms of elements of H_d as consequently,

The “Liu hat diagonals” h_di can be interpreted as leverage in the same sense as the hat diagonals in OLSE. It is important to note that the H_d is not idempotent and it is called a quasi-projection matrix (see [2]).

The single value decomposition (SVD) (see [21]) allows X to be decomposed as X = UDV', where D is a p ´

p diagonal matrix with i-th diagonal elements (

is the i-th eigenvalue of X'X), the column of V are the eigenvectors of X'X. The (ij)-th element of the n ´ p matrix is such that is the projection of the i-th row, x_i, onto the j-th principal axis (eigenvector) of X. Using the SVD, the Liu estimator leverage of the i-th point can be written as (see [22]).

Several important facts can be deduced from the preceding expression. First, for d > 0, for; that is, for every observation the Liu estimator leverage is smaller than the corresponding OLSE leverage. It can be confirmed from the above equation that

Second, the leverage decreases monotonically as d increases. Finally, the rate of decrement of leverage depends on the position of the particular row of X along the principal axes. More specifically, the leverage of a row that lies in the direction of principal axes associated with large eigenvalues will be reduced less than the leverage of a row that lies in the direction of principal axes associated with small eigenvalues (see [2]).

The influence can be differentially affected as d increases. Remember, that influence is not a function of leverage but also of the residual. Although the leverage of every point decreases monotonically as d increases, the effect of this increment on the residuals is far less clear.

The i-th Liu estimator residual is defined as

3.3. DFFITS and Cook’s Measures in Liu Estimator

The DFFITS for Liu estimator can be written as

(5)

where is the Liu estimator in (4) computed with the i-th case deleted and the denominator is an estimator of the standard error of the Liu estimator fitted value. If Liu estimator biasing parameter d is assumed non-stochastic, then

Hence the mean squared error is a function of the fitted values and the response, neither of which depends on individual eigenvalues of it is not affected by multicollinearity. For this reason, the OLSE of will be used as measures of scale.

At least two versions of Cook’s D_i can be constructed for Liu estimator, they are

(6)

and

(7)

where is the direct generalization of Cook’s D in (3) and is based on the fact that

Note that both and simplify to D_i in (3) when d = 1.

It would be desirable to be able to write these measures as functions of leverage and residual, as was done in (2) and (3). This is not possible, however, because of the scale dependency of the Liu estimator. Since the Liu estimator is not scale invariant, (the X matrix with the i-th row deleted) has to be rescaled to unit-column length before computing In the following section some approximate deletion formulas are proposed.

4. Deletion Formulas for Liu Estimator

In the analysis of influential observations to quantify the impact of the i-th case, the most common approach is to compute single-case diagnostics with the i-th case deleted. In Liu regression, it is impossible to derive an exact formula using case deletion because of the scale dependency of the Liu estimator. Hence, approximate deletion formulas are derived for influential measures DFFITS and two versions of Cook’s statistics.

The scale dependence of Liu estimator precludes the development of deletion formulas of the types (2) and (3). The main problem resides in the computation of because the matrix has to be standardized same as in Section 2.1, for the small values of d and/or cases with low leverage. However, approximate deletion formulas can be obtained using the Sherman-MorrisonWoodbury (SMW) theorem (see [1]).

When i-th row is deleted from then can be written as

where is the matrix X without the i-th row and is the vector of response without the i-th entry. We assume that is centered and scaled so that is in correlation form.

If after deletion the i-th row X is not recentered and rescaled. Then, will not be in exact correlation form. Thus,

which uses the SMW theorem with (see the Appendix), can be approximated as:

Therefore,.

Based on the above result, approximate version of (5), (6) and (7) can written as

, (8)

(9)

(10)

5. Examples5.1. Example 1: Macroeconomic Impact of Foreign Direct Investment (MIFDI) Data

Sun [23] studied MIFDI in China 1979-1996. Based on his theory, the MIFDI data were collected in Sri Lanka form 1978 to 2004 to illustrate the methodologies proposed in this paper. The data set consists of four regressors (Foreign Direct Investment, Gross Domestic Product Per Capita, Exchange Rate and Interest Rate) and one response variable (Total Domestic Investment) with 27 observations. The selected variables were tested for statistical conditions: Integration and Multicollinearity. The test results showed that variables are integrated with a same order of integration I(1) at 1% level of significance. The scaled condition number of this data set is 31,244, this large value suggests the presence of an unusually high level of multicollinearity among the regressors (the proposed cutoff is 30; see [1]). The Liu estimator biasing parameter for this data set is estimated d = 0.692.

Global influence measures Leverage, Residual, DFFITS and two versions of Cook’s D_i were computed and the results for the most seven influential cases are given in Table 1. The influential cases detected by these methods are same except case 26 in leverage and case 14 in residual measures, but only the order of magnitude is changed.

Using the approximate case deletion formulas in Equations (8)-(10) the influential cases are estimated and it is given in Table 2. From this table, it can be seen that the influential cases detected by approximate case deletion formulas DFFITS and two versions of Cook’s D_i are same but only the order of magnitude is changed. Moreover, influential cases detected by case deletion one by

Table 1. The most seven influential observations using leverage, residual, DFFITS and two versions of Cook’s.

one formulas in Equations (5)-(7) and approximate deletion formulas in (8)-(10) are exactly same but only the order of magnitude is changed.

For verifying these results, it is plane to contribute Table 3 of Liu estimates for the full data and the data without some influential cases. In this table, the parenthesis value indicates the percentage of change in the parameter value. The result reveals that case 3 is the most influential case while case 22 is the seventh influential point among the detected cases. It is also clear from this table, that omission of single influential cases 3, 23, 1, 27, 2, 18 and 22 contribute the substantial change in the Liu estimator. Among all of these 3, 23, 1, 2 and 18 have a remarkable influence while cases 27 and 22 have a little influence.

5.2. Example 2: Longley Data

The Longley data set [8] has also been used to explain the effect of extreme multicollinearity on the OLSE. The scaled condition number (see [1]) of this data set is 43,275. This large value suggests the presence of high level of multicollinearity among regressors. Cook [24] used this data to identify the influential observations in OLSE using the method of Cook’s D_i and found that cases 5, 16, 4, 10 and 15 (in this order) were the most influential observations (see Table 4). Walker and Birch [2] analyzed the same data to detect anomalous observations in ORRE using case deletion method. They found that cases 16, 10, 4, 15 and 5 (in this order) were most influential observations (see Table 4). Shi and Wang [25] also analyzed the same data to detect influential observations on the ridge regression estimator using local influence method. They detected cases 10, 4, 15, 16 and 1 (in this order) were the most five anomalous observations.

In this paper, I used the same data set to assess the influential observations in Liu estimator using global influence methods: Cook’s D_i, DFFITS, Leverage and Residual. The estimated results, the most five influential cases and the corresponding values are given in Table 4.

The approximate deletion formulas in Equations (8)- (10) are used to detect influential cases for Longley data and detected influential cases are given in Table 5. According to this table, it can be confirmed that identified influential cases using approximate deletion formulas DFFITS and two versions of Cook’s D_i are exactly same. Besides, influential cases identified by case deletion one by one formulas in Equations (5)-(7) and approximate deletion formulas in (8)-(10) are precisely same for Longly data.

The influential cases in MIFDI and Longly data were identified using one by one deletion formulas in Equations (5)-(7) and approximate deletion formulas in Equations (8)-(10) respectively. The identified influential cases for MIFDI and Longly data are same for both measures but only the order of magnitude is changed for MIFDI data. Hence, instead of using the one by one case deletion formulas to identify influential cases in Liu estimator the approximate case deletion formulas are more suitable and appropriate to identify influential cases in Liu estimator.

Table 2. The most seven influence observations according to approximate case deletion formulas.

Table 3. Impact of influential cases on Liu estimator parameter.

Table 4. The five most influence observations in Longley data.

Table 5. The most five influential cases for Longly data using deletion formulas.

6. Discussion

In this article, I show that the Liu estimator user not rely on influence measures obtained for OLSE. Once the value of d is determined, influence measures should be computed for that d. If, after analyzing these indexes, it is decided to delete one or more cases from the analysis, the whole situation has to be reassessed in terms of both influence and multicollinearity.

In this research study, the Liu estimator shrinkage parameter d is estimated first and for that d value the Liu estimator co-efficients are estimated. Using these parameter quantities the influential observations are identified. But, the value of shrinkage parameter d depends on the every observation. Hence, for every influential case the value of d should be estimated. This is very difficult task so this issue will be studied in future research study.

The main advantage of the deletion formulas in Section 4 is that, as in least squares, the estimator does not have to be computed every time a case is deleted. For a value of d all of the elements in (8), (9) and (10) are readily available from a single run of Liu estimator. Moreover, these measures, based on deletion formulas are particularly helpful for large data sets. Furthermore, the deletion formulas provide computationally inexpensive approximate influence measures for Liu estimator.

Although no conventional cutoff points are introduced or developed for the Liu estimator global influence diagnostics: Cook’s measures, DFFITS, Leverage and Liu Estimator standardized residual, it seems that index plot is an optimistic and conventional procedure to disclose influential cases. It is a bottleneck for cutoff values for the influence method. These are additional active issues for future research study.

REFERENCESAppendixSherman Morrison-Woodbury (SMW) Theorem

Consider the p × p matrix and let be the i-th row of Note that is the matrix with the i-th row removed. The inverse matrix of is:

References1

D. A. Belsley, E. Kuh and R. E. Welsch, “Regression Diagnostics: Identifying Influence Data and Source of Collinearity,” Wiley, New York, 1980. doi:10.1002/0471725153

E. Walker and J. B. Birch, “Influence Measures in Ridge Regression,” Technometrics, Vol. 30, No. 2, 1988, pp. 221- 227. doi:10.1080/00401706.1988.10488370

D. A. Belsley, “Conditioning Diagnostics: Collinearity and Weak Data in Regression,” Wiley, New York, 1991.

L. Shi, “Local Influence in Principal Component Analysis,” Biometrika, Vol. 84, No. 1, 1997, pp. 175-186. doi:10.1093/biomet/84.1.175

A. Jahufer and J. B. Chen, “Assessing Global Influential Observations in Modified Ridge Regression,” Statistics and Probability Letters, Vol. 79, No. 4, 2009, pp. 513- 518. doi:10.1016/j.spl.2008.09.019

A. Jahufer and J. Chen, “Measuring Local Influential Observations in Modified Ridge Regression,” Journal of Data Science, Vol. 9, No. 3, 2011, pp. 359-372.

A. Jahufer and J. B. Chen, “Identifying Local Influential Observations in Liu Estimator,” Journal of Metrika, Vol. 75, No. 3, 2012, pp. 425-438. doi:10.1007/s00184-010-0334-4

J. W. Longley, “An Appraisal of Least Squares Programs for Electronic Computer for the Point of View of the User,” Journal of American Statistical Association, Vol. 62, No. 319, 1967, pp. 819-841. doi:10.1080/01621459.1967.10500896

R. D. Cook and S. Weisberg, “Residuals and Influence in Regression,” Chapman & Hall, London, 1982.

K. Liu, “A New Class of Biased Estimate in Linear Regression,” Communications in Statistics—Theory and Methods, Vol. 22, No. 2, 1993, pp. 393-402.

A. E. Hoerl and R. W. Kennard, “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, Vol. 12, No. 1, 1970, pp. 55-67. doi:10.1080/00401706.1970.10488634

C. Stein, “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” Proceeding of the third Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, December 1954 and July-August 1955, pp. 197-206.

G. C. Mcdonald and D. I. Galarneau, “A Monte Carlo Evaluation of Some Ridge-Type Estimators,” Journal of American Statistical Association, Vol. 70, No. 350, 1975, pp. 407-416. doi:10.1080/01621459.1975.10479882

C. L. Mallows, “Some comments on Cp,” Technometrics, Vol. 15, No. 4, 1973, pp. 661-675.

G. Wahba, G. H. Golub and C. G. Heath, “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter,” Technometrics, Vol. 24, No. 2, 1979, pp. 215-223. doi:10.1080/00401706.1979.10489751

F. Akdeniz and S. Ka?iranlar, “More on the New Biased Estimator in Linear Regression,” The Indian Journal of Statistics, Vol. 63, No. 3, 2001, pp. 321-325.

S, Ka?iranlar and S. Sakallio?in, “Combining the Liu Estimator and the Principal Component Regression Estimator,” Communications in Statistics—Theory and Methods, Vol. 30, No. 12, 2001, pp. 2699-2705.

S, Ka?iranlar, G. P. H. Styan and H. J. Werner, “A New Biased Estimator In Linear Regression and a Detailed Analysis of the Widely Analyzed Dataset on Portland Cement,” The Indian Journal of Statistics, Vol. 61, No. B3, 1999, pp. 443-459.

M. H. Hubert and P. Wijekoon, “Improvement of the Liu Estimator in Linear Regression Model,” Journal of Statistical Papers, Vol. 47, No. 3, 2006, pp. 471-479. doi:10.1007/s00362-006-0300-4

N. Torigoe and K. Ujiie, “On the Restricted Liu Estimator in the Gauss-Markov Model,” Communications in Statistics—Theory and Methods, Vol. 35, No. 9, 2006, pp. 1713-1722.

J. Mandel, “Use of the Singular Value Decomposition in Regression Analysis,” The American Statistician, Vol. 36, No. 1, 1982, pp. 15-24.

A. S. Top?uba?i and N. Billor, “A Class of Biased Estimators and Their Diagnostic Measures,” 2001. http://idari.cu.edu.tr /sempozyum/bil26.htm

H. Sun, “Macroeconomic Impact of Direct Foreign Investment in China: 1979-1996,” Blackwell Publishers Ltd., Oxford, 1998.

R. D. Cook, “Detection of Influential Observations in Linear Regression,” Technometrics, Vol. 19, No. 1, 1977, pp. 15-18. doi:10.2307/1268249

L. Shi and X. Wang, “Local Influence in Ridge Regression,” Computational Statistics & Data Analysis, Vol. 31, No. 3, 1999, pp. 341-353. doi:10.1016/S0167-9473(99)00019-5