_{1}

^{*}

In linear regression analysis, detecting anomalous observations is an important step for model building process. Various influential measures based on different motivational arguments and designed to measure the influence of observations on different aspects of various regression results are elucidated and critiqued. The presence of influential observations in the data is complicated by the presence of multicollinearity. In this paper, when Liu estimator is used to mitigate the effects of multicollinearity the influence of some observations can be drastically modified. Approximate deletion formulas for the detection of influential points are proposed for Liu estimator. Two real macroeconomic data sets are used to illustrate the methodologies proposed in this paper.

The presence of multicollinearity in the regressors seriously affects the parameter estimation and prediction. Therefore, mixed estimation and ridge type regression are suggested to mitigate this effect. In addition to multicollinearity, the presence of influential observations in the observed data is seriously affected by the estimators as those estimators are not unbiased [1,2].

In literature, many authors [1-3] noted that the influential observations on ridge type estimators are different from the corresponding least squares estimate and that multicollinearity can even disguise anomalous data. Belsley [

The main aim of this paper is therefore to assess the global influence of observations in the linear Liu estimator using the method of case deletion. This method has been extensively studied and it is very powerful for detecting influential cases because of its intuitive appeal and its direct connection to the sample influence curves. Also, it is widely accepted as the foundation of many other statistical methods approached. The methodology proposed in this paper is illustrated using two real macroeconomic data sets. The first Data set is macro impact of foreign direct investment in Sri Lanka. This data set contains four regressors and a response variable with 27 observations. The second data set is Longley [

A matrix multiple linear regression model can be written as

where y is an n × 1 response vector, X is an n × p centered and standardized known matrix (i.e. the length of the column of X is standardized to one), is a p × 1 vector of an unknown parameter, e is an n × 1 error vector with and and is an identity matrix of order n. Then the ordinary least squares estimator (OLSE) of is The estimator of is where residual vector

This article assumes that the reader is acquainted with the basic ideas of leverage and influence in regression analysis, as presented, for instance, in the works of [

The general purpose of influential analysis is to measure the changes induced in a given aspect of the analysis when the data set is perturbed. A particularly appealing perturbation scheme is case deletion. Note that this scheme is used throughout this article.

In general, the influence of a case can be viewed as the product of two factors: the first a function of the residual and the second a function of the position of the point in the X space. The position or leverage of the i-th point is measured by h_{i}, which is the i-th diagonal element of the “hat” matrix.

Among the most popular single-case influential measure is the difference in fit standardized DFFITS [

where is the least squares estimator of without the i-th case and is an estimator of the standard error (SE) of the fitted values. DFFITS is the standardized change in the fitted value of a case when it is deleted. Thus it can be considered a measure of influence on individual fitted values.

Another useful measure of influence is Cook’s D [

where is a measure of the change in all of the fitted values when a case is deleted. Even though is based on different theoretical consideration, it is closely related to DFFITS.

Points with large values of have considerable influence on the least squares estimate. In general, points for which to be influential. In DFFITS measures any observation for which

warrants attention.

It is important to mention that these measures are useful for detecting single cases with an unduly high influence. For generalizations of Equations (2) and (3) for detecting influential sets (see [1,9]).

The Liu estimator was introduced by Liu [

where I is an identity matrix, is OLSE and d is Liu estimator biasing parameter and it is. The Liu estimator combines the ORRE [11,12] estimator. The ORRE is effective in practice, but it is complicated function of its biasing parameter. Thus we often meet some complicated equations when we use some popular methods, such as ([_{k} criterion [

The Liu estimator is very useful to mitigate the effect of near multicollinearity. Also, the recent literature, particularly in the area of econometrics, engineering and other statistical areas, the Liu estimator has produced a number of new techniques and ideas for example [5-7, 16-20].

Using Equation (4), the vector of fitted value of Liu estimator is

where is Liu estimator “hat matrix” (see [2,10]) and plays the same role as the hat matrix in OLSE. It is important to note that the matrix H_{d} is not a projection matrix because it is not idempotent and H_{d} is called quasi-projection matrix (see [_{d} as consequently,

The “Liu hat diagonals” h_{di} can be interpreted as leverage in the same sense as the hat diagonals in OLSE. It is important to note that the H_{d} is not idempotent and it is called a quasi-projection matrix (see [

The single value decomposition (SVD) (see [

p diagonal matrix with i-th diagonal elements (

is the i-th eigenvalue of X'X), the column of V are the eigenvectors of X'X. The (ij)-th element of the n ´ p matrix is such that is the projection of the i-th row, x_{i}, onto the j-th principal axis (eigenvector) of X. Using the SVD, the Liu estimator leverage of the i-th point can be written as (see [

Several important facts can be deduced from the preceding expression. First, for d > 0, for; that is, for every observation the Liu estimator leverage is smaller than the corresponding OLSE leverage. It can be confirmed from the above equation that

Second, the leverage decreases monotonically as d increases. Finally, the rate of decrement of leverage depends on the position of the particular row of X along the principal axes. More specifically, the leverage of a row that lies in the direction of principal axes associated with large eigenvalues will be reduced less than the leverage of a row that lies in the direction of principal axes associated with small eigenvalues (see [

The influence can be differentially affected as d increases. Remember, that influence is not a function of leverage but also of the residual. Although the leverage of every point decreases monotonically as d increases, the effect of this increment on the residuals is far less clear.

The i-th Liu estimator residual is defined as

The DFFITS for Liu estimator can be written as

where is the Liu estimator in (4) computed with the i-th case deleted and the denominator is an estimator of the standard error of the Liu estimator fitted value. If Liu estimator biasing parameter d is assumed non-stochastic, then

Hence the mean squared error is a function of the fitted values and the response, neither of which depends on individual eigenvalues of it is not affected by multicollinearity. For this reason, the OLSE of will be used as measures of scale.

At least two versions of Cook’s D_{i} can be constructed for Liu estimator, they are

and

where is the direct generalization of Cook’s D in (3) and is based on the fact that

Note that both and simplify to D_{i} in (3) when d = 1.

It would be desirable to be able to write these measures as functions of leverage and residual, as was done in (2) and (3). This is not possible, however, because of the scale dependency of the Liu estimator. Since the Liu estimator is not scale invariant, (the X matrix with the i-th row deleted) has to be rescaled to unit-column length before computing In the following section some approximate deletion formulas are proposed.

In the analysis of influential observations to quantify the impact of the i-th case, the most common approach is to compute single-case diagnostics with the i-th case deleted. In Liu regression, it is impossible to derive an exact formula using case deletion because of the scale dependency of the Liu estimator. Hence, approximate deletion formulas are derived for influential measures DFFITS and two versions of Cook’s statistics.

The scale dependence of Liu estimator precludes the development of deletion formulas of the types (2) and (3). The main problem resides in the computation of because the matrix has to be standardized same as in Section 2.1, for the small values of d and/or cases with low leverage. However, approximate deletion formulas can be obtained using the Sherman-MorrisonWoodbury (SMW) theorem (see [

When i-th row is deleted from then can be written as

where is the matrix X without the i-th row and is the vector of response without the i-th entry. We assume that is centered and scaled so that is in correlation form.

If after deletion the i-th row X is not recentered and rescaled. Then, will not be in exact correlation form. Thus,

which uses the SMW theorem with (see the Appendix), can be approximated as:

Therefore,.

Based on the above result, approximate version of (5), (6) and (7) can written as

Sun [

Global influence measures Leverage, Residual, DFFITS and two versions of Cook’s D_{i} were computed and the results for the most seven influential cases are given in

Using the approximate case deletion formulas in Equations (8)-(10) the influential cases are estimated and it is given in _{i} are same but only the order of magnitude is changed. Moreover, influential cases detected by case deletion one by

one formulas in Equations (5)-(7) and approximate deletion formulas in (8)-(10) are exactly same but only the order of magnitude is changed.

For verifying these results, it is plane to contribute

The Longley data set [_{i} and found that cases 5, 16, 4, 10 and 15 (in this order) were the most influential observations (see

In this paper, I used the same data set to assess the influential observations in Liu estimator using global influence methods: Cook’s D_{i}, DFFITS, Leverage and Residual. The estimated results, the most five influential cases and the corresponding values are given in

The approximate deletion formulas in Equations (8)- (10) are used to detect influential cases for Longley data and detected influential cases are given in _{i} are exactly same. Besides, influential cases identified by case deletion one by one formulas in Equations (5)-(7) and approximate deletion formulas in (8)-(10) are precisely same for Longly data.

The influential cases in MIFDI and Longly data were identified using one by one deletion formulas in Equations (5)-(7) and approximate deletion formulas in Equations (8)-(10) respectively. The identified influential cases for MIFDI and Longly data are same for both measures but only the order of magnitude is changed for MIFDI data. Hence, instead of using the one by one case deletion formulas to identify influential cases in Liu estimator the approximate case deletion formulas are more suitable and appropriate to identify influential cases in Liu estimator.

In this article, I show that the Liu estimator user not rely on influence measures obtained for OLSE. Once the value of d is determined, influence measures should be computed for that d. If, after analyzing these indexes, it is decided to delete one or more cases from the analysis, the whole situation has to be reassessed in terms of both influence and multicollinearity.

In this research study, the Liu estimator shrinkage parameter d is estimated first and for that d value the Liu estimator co-efficients are estimated. Using these parameter quantities the influential observations are identified. But, the value of shrinkage parameter d depends on the every observation. Hence, for every influential case the value of d should be estimated. This is very difficult task so this issue will be studied in future research study.

The main advantage of the deletion formulas in Section 4 is that, as in least squares, the estimator does not have to be computed every time a case is deleted. For a value of d all of the elements in (8), (9) and (10) are readily available from a single run of Liu estimator. Moreover, these measures, based on deletion formulas are particularly helpful for large data sets. Furthermore, the deletion formulas provide computationally inexpensive approximate influence measures for Liu estimator.

Although no conventional cutoff points are introduced or developed for the Liu estimator global influence diagnostics: Cook’s measures, DFFITS, Leverage and Liu Estimator standardized residual, it seems that index plot is an optimistic and conventional procedure to disclose influential cases. It is a bottleneck for cutoff values for the influence method. These are additional active issues for future research study.

Consider the p × p matrix and let be the i-th row of Note that is the matrix with the i-th row removed. The inverse matrix of is:

.