_{1}

A frequent problem in estimating logistic regression models is a failure of the likelihood maximization algorithm to converge. Although popular and extremely well established in bias correction for maximum likelihood estimates of the parameters for logistic regression, the behaviour and properties of the maximum likelihood method are less investigated. The main aim of this paper is to examine the behaviour and properties of the parameters estimates methods with reduction technique. We will focus on a method used a modified score function to reduce the bias of the maximum likelihood estimates. We also present new and interesting examples by simulation data with different cases of sample size and percentage of the probability of outcome variable.

The logistic regression methods are often used to interpret the statistical analysis of dichotomous outcome variables. It is commonly applied procedure for describing the relationship between a binary outcome variable. The general method of estimating the logistic regression parameter is maximum likelihood (ML). In a very general sense the ML method yields values for the unknown parameters that maximize the probability of the observed set of data. The commonly problem with using ML method is convergence problem, which occurs when the maximum likelihood estimates (MLE) do not exist. The subject of the assessment behaviour of MLE for logistic regression model is important, as the logistic model is widely used in medical statistics. Much work discusses on logistic regression model address converges problem like [

The goal of a logistic regression analysis is to find the best fitting model to describe the relationship between an outcome and covariates where the outcome is dichotomous. [

The Model

Suppose now y i ~ binomial ( m i , π i ) where y i , i = 1 , ⋯ , n is a response variable. Suppose that y i ∈ { 0,1, ⋯ , m i } and π i are related to a collection of covariates ( x i 1 , x i 2 , ⋯ , x i p ) according to the equation

log ( π i 1 − π i ) = ∑ j = 1 p β j x i j = β T x i (1)

We consider the special case m i = 1 so y i ~ binomial ( 1, π i ) where π i is the probability of success for each i = 1 , ⋯ , n . We also define η i = β T x i so that

g ( π i ) = log ( π i 1 − π i ) = η i (2)

and

π i = exp ( η i ) 1 + exp ( η i ) (3)

Here g ( . ) is called the logit link function and η i = ∑ j = 1 p β j x i j = β T x i is the linear predictor.

There are some other link functions which can also be used, instead of the logit link function such as the probit link function

η = Φ − 1 ( π ) ↔ π = Φ ( η )

and the complementary log-log link function

η = log ( − log ( 1 − π ) ) ↔ π = 1 − exp ( − e η )

Fitting The Model

The logistic model when y i ~ binomial ( m i , π i ) with m i = 1 can be fitted using the method of maximum likelihood to estimate the parameters. The first step is to construct the likelihood function which is a function of the unknown parameters. we choose those values of the parameters that maximize this function. The probability function of the model is

f ( y i ) = π i y i ( 1 − π i ) 1 − y i (4)

where the likelihood function is

L ( π i | y i ) = ∏ i = 1 n f ( y i | π i ) (5)

Since the observations are independent, the likelihood function is as follows:

L ( π i | y i ) = ∏ i = 1 n π i y i ( 1 − π i ) 1 − y i (6)

The maximum likelihood estimate of β is the value which maximizes the likelihood function. In general the log likelihood function is easier to work with mathematically and is:

l = ∑ i = 1 n [ y i log ( π i ) + ( 1 − y i ) log ( 1 − π i ) ] (7)

In this case the logistic regression model with two covariates, thus, p = 2 , with one the general mean. So, we have β 0 and β 1 , such that

g ( π i ) = η i = β 0 + β 1 x i (8)

where x i is now a scalar covariate and

π i = exp ( β 0 + β 1 x i ) 1 + exp ( β 0 + β 1 x i ) (9)

Therefore we can write the log-likelihood function as:

l ( β 0 , β 1 ) = ∑ i = 1 n y i ( β 0 + β 1 x i ) − log [ 1 + exp ( β 0 + β 1 x i ) ] (10)

To estimate the values of β 0 and β 1 we differentiate l ( β 0 , β 1 ) in terms of β 0 and β 1 respectively as:

∂ l ∂ β 0 = ∑ i = 1 n y i − exp ( β 0 + β 1 x i ) 1 + exp ( β 0 + β 1 x i ) = ∑ i = 1 n ( y i − π i ) (11)

∂ l ∂ β 1 = ∑ i = 1 n y i x i − x i ( exp ( β 0 + β 1 x i ) ) 1 + exp ( β 0 + β 1 x i ) = ∑ i = 1 n ( y i − π i ) x i (12)

Now we set ∂ l ∂ β 0 = 0 and ∂ l ∂ β 1 = 0 and so the maximum likelihood estimates

of β 0 and β 1 are the solution of the following equations

∑ i = 1 n y i = ∑ i = 1 n π i (13)

and

∑ i = 1 n y i x i = ∑ i = 1 n π i x i (14)

and will be denoted as β ^ 0 and β ^ 1 . We know that for the logistic regression the last two equations are non linear in β 0 and β 1 , and we need to use a numerical method for their solution, such as Newton-Raphson method.

The estimated parameters β ^ = ( β ^ 0 , β ^ 1 ) ′ , have an asymptotic distribution which is given by β ^ ~ N ( β , I ( β ) − 1 ) where I ( β ) is Fisher’s information matrix defined as

I ( β ) = − E ( ∂ l 2 ∂ β 0 2 ∂ l 2 ∂ β 0 ∂ β 1 ∂ l 2 ∂ β 1 ∂ β 0 ∂ l 2 ∂ β 1 2 ) (15)

where the matrix is evaluated at the MLE. For the logistic regression the estimated Fisher Information matrix can be writen as

I ( β ^ ) = ( ∑ i = 1 n π ^ i ( 1 − π ^ i ) ∑ i = 1 n x i π ^ i ( 1 − π ^ i ) ∑ i = 1 n x i π ^ i ( 1 − π ^ i ) ∑ i = 1 n x i 2 π ^ i ( 1 − π ^ i ) ) (16)

where π ^ i = exp ( η ^ ) 1 + exp ( η ^ ) and η ^ = β ^ 0 + β ^ 1 x i . The variance of β ^ is approximated

defined by Var ( β ^ ) = I ( β ^ ) − 1 .

A problem occurs in estimating logistic regression models when the maximum likelihood estimates do not exist and one or more components of β ^ are infinite. The one case of the occurrence of this problem is when all of the observations have the same response. For example, suppose that m i = 1 and that all of the response variables equal zero i.e., ∑ i = 1 n y i = 0 . In this case the log-likelihood function is

l ( β 0 , β 1 ) = ∑ i = 1 n − log [ 1 + exp ( β 0 + β 1 x i ) ] (17)

Now differentiating l ( β 0 , β 1 ) in terms of β 0 and β 1 respectively and setting equal to zero gives

∑ i = 1 n π i = ∑ i = 1 n exp ( β ^ 0 + β ^ 1 x i ) 1 + exp ( β ^ 0 + β ^ 1 x i ) = 0 (18)

and

∑ i = 1 n π i x i = ∑ i = 1 n x i exp ( β ^ 0 + β ^ 1 x i ) 1 + exp ( β ^ 0 + β ^ 1 x i ) = 0 (19)

The first equation has no solution because it is the sum of positive quantities and so cannot be equal to zero and satisfy the equation. To make this equation equal to zero we need to make β 0 larger and negative i.e. tend to − ∞ . However, if precisely one of the response variable equal 1, the result maximum likelihood equation become

∑ i = 1 n π i = 1 (20)

∑ i = 1 n π i x i = x 1 (21)

where we have assumed the numbers such that y 1 = 1 . Here the maximum likelihood estimates is exist and the convergence of the MLE is achieved. Because the two previous equations are sum of positive quantities equal positive values. So as in first equation, if parameter is large and positive, then the sum is larger than one as well as if it is large and negative, then the sum is smaller than one and will not satisfy the equation, then we can find finite estimate of parameters which satisfy the equation.

Firth [

b ( β ) n . The point that discussed in case of small size sets of data, it is not

uncommon that β ^ is infinite in some samples of logistic regression models [

∂ l ( β ) ∂ β = U ( β ) = 0 (22)

[

U * ( β ) = U ( β ) − I ( β ) b ( β ) (23)

and the expected value of β ^ proposed by [

E ( β ^ ) = β + b ( β ^ ) + O ( n − 1 ) (24)

where

b ( β ) = 2 E ( ∂ l ∂ β ∂ l 2 ∂ β 2 ) + E ( ∂ l 3 ∂ β 3 ) 2 { E ( ∂ l 2 ∂ β 2 ) } 2

The variance of β ^ is approximated defined by Var ( β ^ ) ≃ I ( β ^ ) − 1 .

In this part we will apply the modified score function to simple logistic regression model. We know that the O ( n − 1 ) bias vector given in the form

b = ( X T W X ) − 1 X T W ξ which proposed by [

h i ( π i − 1 2 ) and h i is the ith diagonal element of the hat matrix

H = W 1 / 2 X ( X T W X ) − 1 X T W 1 / 2 .

where W = d i a g ( π i ( 1 − π i ) ) and X is the design matrix. Then, the modified score function is written as

U * = U − X T W ξ (25)

In this case, the modified score function U * = ( U 0 * , U 1 * ) gives two equations

U 0 * = ∑ i = 1 n [ ( y i + h i 2 ) − ( 1 + h i ) π i ] = 0 (26)

and

U 1 * = ∑ i = 1 n [ ( y i + h i 2 ) − ( 1 + h i ) π i ] x i = 0 (27)

These are used to estimate the parameters.

For more evaluation, we will discuss the behaviour of the adjusted score function when all the observation have the same response i.e. ∑ i = 1 n y i = 0 . As a special case, suppose we have one explanatory variable x i taking values 0 or 1. Before we calculate the adjusted score function, first calculate the form of h i which we obtain from H . Here, h i is the diagonal element of the H matrix and is

h i = π i ( 1 − π i ) ( X 2 − 2 x i X 1 + x i 2 X 0 ) Δ (28)

where Δ = X 0 X 2 − X 1 2 , X 0 = n 0 π 0 ( 1 − π 0 ) + n 1 π 1 ( 1 − π 1 ) , X 1 = n 1 π 1 ( 1 − π 1 ) and X 2 = n 1 π 1 ( 1 − π 1 ) , where n 0 and n 1 are the number of observations of x equal to 0 and 1 respectively. Hence

h 0 = π 0 ( 1 − π 0 ) [ n 1 π 1 ( 1 − π 1 ) ] Δ (29)

and

h 1 = π 1 ( 1 − π 1 ) [ n 0 π 0 ( 1 − π 0 ) ] Δ (30)

Therefore, when we set the adjusted score function ( U 0 * , U 1 * ) = 0 with

∑ i = 1 n y i = 0 we have

U 1 * = ∑ i = 1 n [ h i 2 − ( 1 + h i ) π i ] x i = 0 (31)

This gives

[ h 1 2 − ( 1 + h 1 ) π 1 ] = 0 (32)

and

π 1 = h 1 2 ( 1 + h 1 ) (33)

Now,

U 0 * = ∑ i = 1 n [ h i 2 − ( 1 + h i ) π i ] = 0 (34)

and so

[ h 1 n 1 2 − n 1 ( 1 + h 1 ) π 1 ] + [ h 0 n 0 2 − n 0 ( 1 + h 0 ) π 0 ] = 0 (35)

we get

π 0 = h 0 2 ( 1 + h 0 ) (36)

Before calculate π 0 and π 1 we can consider the following way to calculate h 0 and h 1 . Let A = n 1 π 1 ( 1 − π 1 ) and B = n 0 π 0 ( 1 − π 0 ) . Then, X 0 = A + B , X 1 = A and X 2 = A , so, we can write Δ as

Δ = X 0 X 2 − X 1 2 = A 2 + A B − A 2 = A B = n 1 π 1 ( 1 − π 1 ) n 0 π 0 ( 1 − π 0 ) (37)

Therefore, h 0 and h 1 can be written as

h 0 = π 0 ( 1 − π 0 ) [ n 1 π 1 ( 1 − π 1 ) ] [ n 1 π 1 ( 1 − π 1 ) ] n 0 π 0 ( 1 − π 0 ) = 1 n 0 (38)

and

h 1 = π 1 ( 1 − π 1 ) [ n 0 π 0 ( 1 − π 0 ) ] [ n 1 π 1 ( 1 − π 1 ) ] n 0 π 0 ( 1 − π 0 ) = 1 n 1 (39)

Then, we obtain

π 0 = h 0 2 ( 1 + h 0 ) = 1 / n 0 2 ( 1 + 1 / n 0 ) = 1 2 ( n 0 + 1 ) (40)

and

π 1 = h 1 2 ( 1 + h 1 ) = 1 / n 1 2 ( 1 + 1 / n 1 ) = 1 2 ( n 1 + 1 ) (41)

As a result of this example with x = 0 , 1 when ∑ i = 1 n y = 0 , we can say that, the estimate of parameters are finite. The modified function works well and the problem of convergence does not exist.

The follow discussion are the simulation plan and the designs used in generating the data to identify the effect of sample size and proportion of events (the percentage of y = 1 or y = 0 ) on estimation of parameters. We will examine the precision of the estimation by calculating the variance of parameters obtained by simulation for the two approaches, MLE and Firth, and compare those with I ( β ) − 1 evaluated at the known values of β . The simulation study is designed as follows:

1) Thre sample sizes have been used n = 40 , n = 120 and n = 500 .

2) For each sample size we choose x i as a draw from N ( 0,1 ) . The x variables are fixed at these values throughout the simulation.

3) We choose β 0 and β 1 to give three cases. Choose β 1 = 0.2 and adjust β 0 so that over the covariates pr ( y = 1 ) is approximately (a) 0.5, (b) 0.1, (c) 0.05.

4) For each sample size and set of parameter values we perform 100,000 simulation.

5) Two approaches are used to estimate the parameters, MLE and the bias- reduced estimator Firth.

The simulation reported the accuracy of the estimation of Var ( β ^ ) using the information matrix. We calculate Var ( β ^ 0 ) and Var ( β ^ 1 ) for the simulated values of β ^ 0 , β ^ 1 and also by evaluating I ( β ) at the known values of β . The results in the

As can be seen in

The variance of parameters calculated by Firth’s method were smaller than when calculated by MLE and the ratio in general was close to 1. Moreover, the bias ( β ^ F - β ) was smaller.

In this part using the same way used in the previous case when n = 500 . The

parameter | Ratio L | Ratio F | ||||
---|---|---|---|---|---|---|

0.5 | 0.00804 | 0.00813 | 0.00809 | 0.99 | 1.005 | |

- | 0.00869 | 0.00862 | 0.00858 | 1.01 | 1.04 | |

0.1 | 0.02407 | 0.02359 | 0.02312 | 1.04 | 1.02 | |

- | 0.02455 | 0.02439 | 0.02390 | 1.03 | 1.02 | |

0.05 | 0.04938 | 0.04560 | 0.04411 | 1.12 | 1.03 | |

- | 0.04725 | 0.04656 | 0.04525 | 1.04 | 1.03 |

parameter | |||
---|---|---|---|

0.5 | −0.0005 | 0.0003 | |

- | 0.0007 | −0.0001 | |

0.1 | −0.0210 | 0.0003 | |

- | 0.0004 | 0.0003 | |

0.05 | −0.0400 | −0.0011 | |

- | 0.0026 | 0.0007 |

results of simulation are shown in

Here for only 99,806 (99%) of the data sets was it possible to obtain finite estimates of β 0 and β 1 converged. Moreover, the variance of the parameters β ^ 0 and β ^ 1 is large. This is because even though convergence is achieved when ∑ i = 1 n y i = 1 , There are some very large negative values of β ^ . In the other two cases of pr ( y = 1 ) = ( 0.5 , 0.1 ) we achieved ML convergence in every simulation. We note that the ratio is nearly one but is a bit high when compared with case of n = 500 . Firth’s approach showed reasonable results, all cases achieved the maximum likelihood convergence. Moreover, the ratio was better than MLE approach as well as the bias β ^ F - β .

We used the same analysis as in the previous cases with n = 40 . As can be seen in

parameter | Var Inf | Ratio L | Ratio F | |||
---|---|---|---|---|---|---|

0.5 | 0.034 | 0.0335 | 0.0337 | 1.04 | 0.99 | |

- | 0.035 | 0.0331 | 0.0326 | 1.07 | 1.01 | |

0.1 | 0.125 | 0.1037 | 0.0951 | 1.32 | 1.09 | |

- | 0.105 | 0.0942 | 0.0885 | 1.27 | 1.06 | |

0.05 | 230.94 | 0.2377 | 0.1811 | - | 1.31 | |

- | 37.96 | 0.2003 | 0.1669 | - | 1.20 |

parameter | |||
---|---|---|---|

0.5 | −0.0005 | −0.0004 | |

- | 0.0056 | −0.0002 | |

0.1 | −0.0845 | −0.0021 | |

- | 0.0055 | 0.0008 | |

0.05 | −0.4480 | −0.0120 | |

- | 0.090 | −0.0027 |

parameter | Var Inf | Ratio L | Ratio F | |||
---|---|---|---|---|---|---|

0.5 | 0.12184 | 0.1076 | 0.10780 | 1.13 | 0.99 | |

- | 0.14889 | 0.1222 | 0.11766 | 1.22 | 1.04 | |

0.1 | 215.38 | 0.3702 | 0.29139 | - | 1.27 | |

- | 81.22 | 0.3962 | 0.34304 | - | 1.16 | |

0.05 | 820.47 | 0.5080 | 0.55184 | - | 0.92 | |

- | 295.64 | 0.5936 | 0.65309 | - | 0.91 |

parameter | |||
---|---|---|---|

0.5 | 0.0029 | −0.0001 | |

- | 0.0220 | −0.0010 | |

0.1 | −1.2000 | −0.0150 | |

- | 0.3830 | −0.014 | |

0.05 | −3.9900 | 0.0440 | |

- | 1.288 | −0.0940 |

98,273 (98%) and 85,967 (86%) of data sets achieved ML convergence when pr ( y = 1 ) was (0.1, 0.05), respectively. Convergence was only achieved in every simulation in the case of pr ( y = 1 ) = 0.5 , where the ratio was nearly close to one, but is a bit high from previous cases. Moreover, we found the same problem as discussed in the case of n = 120 , in that the variance of the parameters β ^ 0 and β ^ 1 is large. However, when we use Firth’s approach, all data sets achieved M.L convergence. Moreover, the ratio was better than M.L.E approach as well as the bias β ^ F - β being smaller.

Attention has been directed in this work to determine the behaviour of the asymptotic estimation of parameters by two methods―MLE and bias reduction technique compared with the result of the information matrix. In fact in regular convergence problem the modified score function appeared appropriate behaviour, which denoted that the bias form may be removed from the MLE by reduction bias term. The asymptotic variance of the MLE may be appeared as strange behaviour, and the results shown variance of the parameters were large in some cases, even though convergence is achieved. It is denoted that there are some very large negative values of β ^ , as shown in results section. We can report that the small sample size and the value of pr ( y = 1 ) have an effect on behaviour estimation of parameters when using MLE. Clearly, we found conver- gence problem for some combinations of sample size and pr ( y = 1 ) . The approach of Firth appeared a moderate results that the data sets in all cases of sample size and pr ( y = 1 ) achieved ML convergence. Overall, we can consider the bias reduction technique is worked well and has a moderate behaviour almost with all cases which have been investigated. Moreover, the convergence problem is not only effective on behaviour of the MLE, and although the convergence is achieved, the variance of the parameters estimates appeared large value.

Badi, N.H.S. (2017) Properties of the Maximum Likeli- hood Estimates and Bias Reduction for Logistic Regression Model. Open Access Li- brary Journal, 4: e3625. https://doi.org/10.4236/oalib.1103625