An Approximation Method for a Maximum Likelihood Equation System and Application to the Analysis of Accidents Data

doi:10.4236/ojs.2017.71011

Open Journal of Statistics
Vol.07 No.01(2017), Article ID:74501,21 pages
10.4236/ojs.2017.71011

Assi N’Guessan¹, Issa Cherif Geraldo², Bezza Hafidi³

●How to Cite this Article

¹Laboratoire Paul Painlevé (UMR CNRS 8524), Université de Lille 1, Villeneuve d’Ascq, France

²Department of Mathematics and Computer science, Université Catholique de l’Afrique de l’Ouest-Unité Universitaire du Togo, Lomé, Togo

³Department of Mathematics, Faculty of Science, University Ibn Zohr, Agadir, Morocco

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: August 8, 2016; Accepted: February 25, 2017; Published: February 28, 2017

ABSTRACT

There exist many iterative methods for computing the maximum likelihood estimator but most of them suffer from one or several drawbacks such as the need to inverse a Hessian matrix and the need to find good initial approximations of the parameters that are unknown in practice. In this paper, we present an estimation method without matrix inversion based on a linear approximation of the likelihood equations in a neighborhood of the constrained maximum likelihood estimator. We obtain closed-form approximations of solutions and standard errors. Then, we propose an iterative algorithm which cycles through the components of the vector parameter and updates one component at a time. The initial solution, which is necessary to start the iterative procedure, is automated. The proposed algorithm is compared to some of the best iterative optimization algorithms available on R and MATLAB software through a simulation study and applied to the statistical analysis of a road safety measure.

Keywords:

Constrained Maximum Likelihood, Partial Linear Approximation, Schur’s Complement, Iterative Algorithms, Road Safety Measure, Multinomial Model

1. Introduction

Approximation methods for Maximum Likelihood (ML) systems of equations are of interest and are motivated in this paper by the need to find estimation methods that are simple and easy to implement in the specific field of the statistical evaluation of the impact of a road safety measure. In practice, the estimation methods dedicated to this evaluation depend both on the nature of the measure and the available data. Methods based on frequencies combination have received considerable attention [1] [2] [3] and for the most part of them, we are faced with the estimation of unknown parameters which are often functionally dependent.

Many approximation methods for maximum likelihood estimation need to solve systems of linear or non-linear equations with or without constraints [4] - [9] . Newton-Raphson’s method and Fisher scoring are certainly the most commonly used approximation methods. They consist in updating the whole parameter vector using the iterative scheme:

$ϕ^{(k + 1)} = ϕ^{(k)} + {[M (ϕ^{(k)})]}^{- 1} \nabla l (ϕ^{(k)})$ (1)

where $ϕ^{(k)}$ is the estimate of the vector parameter at the step $k$ , $l$ is the log-likelihood function, $\nabla l (ϕ^{(k)})$ is the gradient of $l$ and $M (ϕ^{(k)})$ is the observed or the expected information matrix. Both methods require the com- putation of second-order partial derivatives and a matrix inversion in each iteration, which can be very costly. Some authors such as Wang [10] have proposed quadratic approximations and extended Fisher scoring. The main point is to use quadratic approximations to the log-likelihood function and optimize these approximations within the constrained parameter space.

Within the framework of crash data analysis, different iterative estimation methods have been proposed [2] [11] . For example, Mkhadri et al. [11] propose a Minorization-Maximization (MM) algorithm for the maximum likelihood estimation of the parameter vector of a multinomial distribution modelling crash data. Their proposed MM algorithm cycles through the components of the parameter vector and updates one component at a time which leads to closed-form expressions of the parameters. They claim that their MM algorithm is simple to implement without any matrix inversion and constraints are integrated easily.

Despite the above advantages, the choice of the starting value $ϕ^{(0)}$ remains a major issue since a value of $ϕ^{(0)}$ relatively far from the true unknown value of the vector parameter can lead to erroneous solutions or to non-convergence. In addition to this, it must be noted that obtaining explicit expressions of standard errors is not generally easy.

In this paper, we present an estimation method without matrix inversion based on a linear approximation of the likelihood equations in a neighborhood of the constrained maximum likelihood estimator. We obtain closed-form ap- proximations of solutions and standard errors. Then, we propose a partial linear approximation (PLA) algorithm which cycles through the components of the vector parameter and updates one component at a time. The initial solution is automated and standard errors are obtained in a closed-form. The PLA is com- pared to some of the best available algorithms on R and MATLAB software through a simulation study and applied to real crash data.

The remainder of the paper is organized as follows. Section 2 is devoted to the statistical model and the main assumptions used to get closed-form appro- ximations and standard errors. The proposed estimation method and the method for computing standard errors are presented in Section 3. The general framework of the proposed algorithm is also described. In Section 4, we give an illustration of our results using a crash data model. The numerical performance of the proposed algorithm is examined in Section 5 through a simulation study while a real-data application is given in Section 6. The appendix is devoted to the technical details of the main results.

2. Statistical Model and Main Assumptions

2.1. Statistical Model

Let $Y = (Y_{11}, \dots, Y_{1 r}, Y_{21}, \dots, Y_{2 r})$ be a random vector with $2 r (r > 1)$ compo- nents and $π (ϕ) = (π_{11} (ϕ), \dots, π_{1 r} (ϕ), π_{21} (ϕ), \dots, π_{2 r} (ϕ))$ be a vector of pro- babilities such that

$\sum_{i = 1}^{2} \sum_{j = 1}^{r} π_{i j} (ϕ) = 1$

where $ϕ$ is a parameter vector. It is assumed that the vector $Y$ has the multinomial distribution $M (n; π (ϕ))$ where $n > 0$ is a known integer. The basic principle of the multinomial distribution $M (n; π (ϕ))$ consists in dis- tributing $n$ items in $2 r$ categories or classes ( $2 r$ being the number of com- ponents of vector $π (ϕ)$ ). The probability for an object to fall in a class is called class probability with the sum of all class probabilities equal to 1. Here, the class probabilities $π_{i j} (ϕ)$ depend on the unknown vector parameter $ϕ$ .

Given a vector of integers $y = (y_{11}, \dots, y_{1 r}, y_{21}, \dots, y_{2 r})$ such that

$\sum_{i = 1}^{2} \sum_{j = 1}^{r} y_{i j} = n,$

the probability function related to $M (n; π (ϕ))$ is defined by

$f (y) = \frac{n!}{\prod_{i = 1}^{r} \prod_{j = 1}^{2} y_{i j}!} \prod_{i = 1}^{r} \prod_{j = 1}^{2} {(π_{i j} (ϕ))}^{y_{i j}} .$ (2)

2.2. ML Estimation and Main Assumptions

Assumption 1. The vector parameter $ϕ$ is partitioned as $ϕ = (θ, β)$ where $θ > 0$ is a real parameter, $β = {(β_{1}, \dots, β_{r})}^{T}$ is a vector and $β_{j} > 0$ for all $j = 1, \dots, r$ .

Assumption 2. The unknown vector $ϕ$ is subject to a linear constraint $C (ϕ) = 0$ where $C$ is a continuously differentiable function from $ℝ^{r + 1}$ to $ℝ$ .

Let $y = (y_{11}, \dots, y_{1 r}, y_{21}, \dots, y_{2 r}) \in ℝ^{2 r}$ be a vector of observed data and $l (ϕ)$ be the logarithm of the probability density function $f (y)$ defined by Equation (2). The constrained maximum likelihood estimator $\hat{ϕ} = (\hat{θ}, \hat{β})$ , is solution to the optimization problem

$Maximize l (ϕ) subjetto C (ϕ) = 0.$ (3)

Problem (3) is equivalent to the maximization of function

$L (ϕ, λ) = l (ϕ) - λ C (ϕ)$ (4)

where $λ$ is the Lagrange multiplier. The information matrix linked to the constrained maximum likelihood estimator $\hat{ϕ}$ is

$Γ_{ϕ} = [\begin{matrix} J_{ϕ} & C_{ϕ}^{T} \\ C_{ϕ} & 0 \end{matrix}], J_{ϕ} = [\begin{matrix} τ_{ϕ} & U_{ϕ}^{T} \\ U_{ϕ} & B_{ϕ} \end{matrix}]$ (5)

where $C_{ϕ} = {(\partial C / \partial θ, \partial C / \partial β_{1}, \dots, \partial C / \partial β_{r})}^{T} \in ℝ^{1 + r}$ , $J_{ϕ} \in ℝ^{1 + r} \times ℝ^{1 + r}$ ,

$τ_{ϕ} = E (- \frac{\partial^{2} l}{\partial θ^{2}}) \in ℝ, U_{ϕ} = E (- \frac{\partial^{2} l}{\partial θ \partial β_{1}}, \dots, - \frac{\partial^{2} l}{\partial θ \partial β_{r}})$

and $B_{ϕ}$ is a $r \times r$ matrix whose entries are $E (- \partial^{2} l / \partial β_{m} \partial β_{j})$ , $m, j = 1, \dots, r$ . We also assume that the following conditions are verified:

Assumption 3. $\partial C / \partial θ = 0$ and $〈 C_{β}, β 〉 = κ \neq 0$ where $〈 .,. 〉$ is the inner product, $κ$ is a constant and $C_{β} = {(\partial C / \partial β_{1}, \dots, \partial C / \partial β_{r})}^{T} \in ℝ^{r}$ ;

Assumption 4. For any $θ > 0$ , there exists a non-singular $r \times r$ matrix $Ω_{\hat{θ}, y}$ such that $C_{\hat{β}}^{T} Ω_{\hat{θ}, y}^{- 1} C_{\hat{β}} > 0$ and the non-linear system

${(\frac{\partial l}{\partial β})}_{\hat{ϕ}} - \frac{〈 \hat{β}, {(\frac{\partial l}{\partial β})}_{\hat{ϕ}} 〉}{〈 \hat{β}, C_{\hat{β}} 〉} C_{\hat{β}} = 0_{r}$

is approximated by the linear system

$[\begin{matrix} Ω_{\hat{θ}, y} & C_{\hat{β}} \\ C_{\hat{β}}^{T} & 0 \end{matrix}] [\begin{matrix} \hat{β} \\ 0 \end{matrix}] = [\begin{matrix} D_{y} \\ κ \end{matrix}]$

where $0_{r} = {(0, \dots, 0)}^{T} \in ℝ^{r}$ and $D_{y}$ is a $r \times 1$ vector whose components are obtained from $y$ .

Assumption 5. There exists a function $g : ℝ^{r - 1} \to ℝ$ such that the equation

${(\frac{\partial l}{\partial θ})}_{\hat{ϕ}} = 0$ is equivalent to $\hat{θ} = g (\hat{β})$ .

Assumption 6. There exist two strictly positive real numbers $a_{n, ϕ}$ and $b_{n, ϕ}$ , a non-singular $r \times r$ diagonal matrix $Σ_{ϕ}$ and a vector $V_{ϕ} \in ℝ^{r}$ such that $τ_{ϕ} = a_{n, ϕ}^{- 1} b_{n, ϕ}^{2}$ , $B_{ϕ} = a_{n, ϕ} (Σ_{ϕ} + V_{ϕ} V_{ϕ}^{T})$ and $U_{ϕ} = b_{n, ϕ} V_{ϕ}$ .

Assumption 3 specifies the form of function $C$ . Particularly, $C$ is only a function of sub-vector $β$ . Assumptions 4 and 5 enable us to get $\hat{β}$ from $\hat{θ}$ and inversely. Assumption 6 enables us to transform the Fisher information matrix in order to use classical results on matrix inversion with Schur’s com- plement [17] .

3. The Estimation Method

3.1. Partial Linear Approximation Principle

The general problem of finding the constrained maximum likelihood estimator has been discussed by many authors [12] [13] [14] . The classical approach is based on a Newton-type algorithm and computes the components of $\hat{ϕ}$ at once. Except from some few simple cases, it is not generally possible to get explicit expressions of the components of $\hat{ϕ}$ . One shows the following lemma (we refer the reader to the appendix for a proof).

Lemma 1. The constrained maximum likelihood estimator $\hat{ϕ}$ , provided it exists, is solution to the non-linear system

${(\frac{\partial l}{\partial θ})}_{\hat{ϕ}} = 0 and {(\frac{\partial l}{\partial β})}_{\hat{ϕ}} - \frac{〈 \hat{β}, {(\frac{\partial l}{\partial β})}_{\hat{ϕ}} 〉}{〈 \hat{β}, C_{\hat{β}} 〉} C_{\hat{β}} = 0_{r}$ (6)

where $0_{r} = {(0, \dots, 0)}^{T} \in ℝ^{r}$ .

Our approach consists in dividing Equation (6) into two parts: one con- cerning the first component of $\hat{ϕ}$ and the other one concerning the sub-vector $\hat{β}$ .

Theorem 1. Under assumptions 3 - 5, the constrained MLE $\hat{ϕ} = (\hat{θ}, \hat{β})$ is given by:

$\hat{θ} = g (\hat{β})$

$\hat{β} = Ω_{\hat{θ}, y}^{- 1} D_{y} .$

The non-obvious part of the proof consists in the determination of $\hat{β}$ by inverting the $(r + 1) \times (r + 1)$ matrix linked to the linear system in Assumption 4. This result based on the inversion of partitioned matrices will not be demonstrated in this paper. We refer the reader to classical papers on Schur complement [15] [16] .

From Theorem 1, it is seen that the MLE $\hat{ϕ} = (\hat{θ}, \hat{β})$ is a fixed point of the function from $ℝ^{r + 1}$ to itself defined by $(θ, β) \mapsto (g (β), Ω_{θ, y}^{- 1} D_{y})$ . We can then build an iterative algorithm to estimate $ϕ$ . The classical fixed point method which consists in simultaneously updating $\hat{θ}$ and $\hat{β}$ may be hard to im- plement because of the link between $\hat{θ}$ and $\hat{β}$ . We propose instead to alternate between updating $\hat{θ}$ holding $\hat{β}$ fixed and vice-versa. Starting from a given $θ^{(0)}$ , we compute $β^{(0)}$ and then $θ^{(1)}$ and $β^{(1)}$ and so on. The process is repeated until a stopping criteria is satisfied. For example, we can stop the iterations when successive values of the log-likelihood satisfy the condition $| l (ϕ^{(k + 1)}) - l (ϕ^{(k)}) | < ϵ$ where $ϵ > 0$ .

The estimation process may be completed by the computation of standard errors with Theorem 2 below.

Theorem 2. Under assumptions 3 - 6, the asymptotic variance of the com- ponents of $\hat{ϕ}$ are

${\hat{σ}}^{2} (\hat{θ}) = a_{n, \hat{ϕ}} b_{n, \hat{ϕ}}^{- 2} (1 + {‖ V_{\hat{ϕ}} ‖}_{Σ_{\hat{ϕ}}^{- 1}}^{2} - {‖ C_{\hat{β}} ‖}_{Σ_{\hat{ϕ}}^{- 1}}^{- 2} ξ_{\hat{ϕ}}^{2})$ (7)

and

${\hat{σ}}^{2} ({\hat{β}}_{j}) = a_{n, \hat{ϕ}}^{- 1} (Σ_{\hat{ϕ}, j}^{- 1} - {‖ C_{\hat{β}} ‖}_{Σ_{\hat{β}}^{- 1}}^{- 2} \times {(Σ_{\hat{ϕ}, j}^{- 1} \times C_{\hat{β}, j})}^{2}), j = 1, \dots, r$ (8)

where $ξ_{ϕ} = V_{ϕ}^{T} Σ_{ϕ}^{- 1} C_{β} > 0$ and the real values $Σ_{ϕ, j}^{- 1}$ (resp. $C_{β, j}$ ), $j = 1, \dots, r$ , are the diagonal elements (resp. components) of matrix $Σ_{ϕ}^{- 1}$ (resp. of vector $C_{β}$ ).

The proof is given in the appendix. It stems from the results of N'Guessan and Langrand [17] .

3.2. General Framework of the Partial Linear Approximation Algorithm

Algorithm 1 (The partial linear approximation algorithm).

Step 0 (Initialization) Given $θ^{(0)}$ , $ϵ > 0$ , $D_{y}$ , compute $β^{(0)} = Ω_{θ^{(0)}, y}^{- 1} D_{y}$ .

Step 1 (Loop for computing $\hat{ϕ}$ ) For a given $k \geq 0$ ,

a) Compute $θ^{(k + 1)} = g (β^{(k)})$ and $β^{(k + 1)} = Ω_{θ^{(k + 1)}, y}^{- 1} D_{y}$ .

b) If $| l (ϕ^{(k + 1)}) - l (ϕ^{(k)}) | \geq ϵ$ , then replace $k$ by $k + 1$ and return to Step 1.

Else, set $\hat{θ} = θ^{(k + 1)}$ , $\hat{β} = β^{(k + 1)}$ and go to Step 2.

Step 2 (Computation of standard errors)

a) Compute ${\hat{σ}}^{2} (\hat{θ}) = a_{n, \hat{ϕ}} b_{n, \hat{ϕ}}^{- 2} (1 + {‖ V_{\hat{ϕ}} ‖}_{Σ_{\hat{ϕ}}^{- 1}}^{2} - {‖ C_{\hat{β}} ‖}_{Σ_{\hat{ϕ}}^{- 1}}^{- 2} ξ_{\hat{ϕ}}^{2})$ .

b) For $j = 1, \dots, r$ , compute ${\hat{σ}}^{2} ({\hat{β}}_{j}) = a_{n, \hat{ϕ}}^{- 1} (Σ_{\hat{ϕ}, j}^{- 1} - {‖ C_{\hat{β}} ‖}_{Σ_{\hat{β}}^{- 1}}^{- 2} \times {(Σ_{\hat{ϕ}, j}^{- 1} \times C_{\hat{β}, j})}^{2})$ .

The aim of this paper is not to conduct a theoretical study of the convergence of the proposed algorithm. We rather focus on the numerical aspect of this convergence through an application model. Nevertheless, we can notice that the estimation of $\hat{ϕ}$ using our algorithm does not require any matrix inversion. It is thus easy to think that getting the constrained maximum likelihood estimator $\hat{ϕ}$ is improved in terms of computation time.

4. Application to the Combination of Crash Data

4.1. Statistical Model

We apply the above algorithm to estimate the parameters of a statistical model used to assess the effect of a road safety measure applied to an experimental site presenting $r > 0$ mutually exclusive accident types (fatal accidents, seriously injured people, slightly injured people, material damage, etc $\dots$ ) over a fixed period. Let us consider the random vector $Y = (Y_{11}, \dots, Y_{1 r}, Y_{21}, \dots, Y_{2 r})$ where $Y_{1 j}$ and $Y_{2 j} (j = 1, \dots, r)$ respectively represent the number of accidents of type $j$ registered on the experimental site before and after the application of the road safety measure. In order to take into account some external factors (such as traffic flow, speed limit variation, weather conditions, regression to the mean effect, etc $\dots$ ), the experimental site is linked to a control area where the safety measure was not directly applied. The accidents data for the control area over the same period is given by the non-random vector $Z = {(z_{1}, \dots, z_{r})}^{T}$ where $z_{j}$ denotes the ratio of the number of accidents of type $j$ registered in the period after to the number of accidents of type $j$ registered in the period before. Following N’Guessan et al. [18] , we assume that the vector $Y$ has the multi- nomial distribution

$Y \sim M (n, π (ϕ))$

where $n$ is the total number of crashes recorded at the experimental site and $π = (π_{11}, \dots, π_{1 r}, π_{21}, \dots, π_{2 r})$ is a vector of class probabilities whose components are

$π_{i j} (ϕ) = (\begin{matrix} \frac{β_{j}}{1 + θ \sum_{m = 1}^{r} z_{m} β_{m}} & if i = 1; j = 1, \dots, r \\ \frac{θ β_{j} \sum_{m = 1}^{r} z_{m} β_{m}}{1 + θ \sum_{m = 1}^{r} z_{m} β_{m}} & if i = 2; j = 1, \dots, r . \end{matrix}$ (9)

By construction, the parameter vector $ϕ = (θ, β)$ of this model satisfies the conditions $θ > 0$ , $β_{j} > 0$ and the linear constraint

$C (ϕ) = 0, with C (ϕ) = 〈 1_{r}, β 〉 - 1$ (10)

where $1_{r} = (1, \dots, 1) \in ℝ^{r}$ . The scalar $θ$ denotes the unknown parameter average effect of the road safety measure while each $β_{j} (j = 1, \dots, r)$ denotes the risk of having an accident of type $j$ on a site having the same characteristics as the experimental site. This model is a special case of the multinomial model proposed by N’Guessan et al. [18] which was applied simultaneously on several sites.

4.2. Cyclic Estimation of the Average Effect and the Different Accident Risks

The log-likelihood is specified up to an additive constant by

$l (ϕ) = \sum_{m = 1}^{r} [y_{. m} \log (β_{m}) + y_{2 m} \log (θ) - y_{. m} \log (1 + θ \sum_{j = 1}^{r} z_{j} β_{j}) + y_{2 m} \log (\sum_{j = 1}^{r} z_{j} β_{j})]$ (11)

where $y_{. m} = y_{1 m} + y_{2 m}$ . Different iterative methods can be used to compute the constrained MLE $\hat{ϕ}$ . Most of them look for stationary points of the Lagrangian

$L (ϕ, λ) = l (ϕ) - λ (\sum_{m = 1}^{r} β_{m} - 1) .$

N'Guessan et al. [18] showed that a stationary point of $L (ϕ, λ)$ must be the solution to the following system of non-linear equations:

${\begin{array}{l} \sum_{m = 1}^{r} (y_{2 m} - y_{. m} \frac{\hat{θ} \hat{E} (Z)}{1 + \hat{θ} \hat{E} (Z)}) = 0 \\ y_{. j} - \frac{n {\hat{β}}_{j} (1 + \hat{θ} z_{j})}{1 + \hat{θ} \hat{E} (Z)} - \frac{y_{2.} {\hat{β}}_{j} (\hat{E} (Z) - z_{j})}{\hat{E} (Z)} = 0, j = 1, \dots, r \\ \hat{θ} > 0, {\hat{β}}_{j} > 0, j = 1, \dots, r \end{array}$ (12)

where $y_{2.} = \sum_{m = 1}^{r} y_{2 m}$ and $\hat{E} (Z) = \sum_{m = 1}^{r} {\hat{β}}_{m} z_{m}$ .

The main idea proposed in this paper consists in neglecting the term $[y_{2.} {\hat{β}}_{j} (\hat{E} (Z) - z_{j})] / \hat{E} (Z)$ so that we can write the remaining equations

$y_{. j} - \frac{n {\hat{β}}_{j} (1 + \hat{θ} z_{j})}{1 + \hat{θ} \hat{E} (Z)} = 0, j = 1, \dots, r$

as the linear system of equations

$Ω_{\hat{θ}, y} \hat{β} = D_{y}$ (13)

where $Ω_{\hat{θ}, y} = M_{\hat{θ}} - \hat{θ} D_{y} Z^{T}$ , $M_{\hat{θ}} = diag (1 + \hat{θ} z_{1}, \dots, 1 + \hat{θ} z_{r})$ is a diagonal $r \times r$ matrix and

$D_{y} = {(\frac{y_{11} + y_{21}}{n}, \dots, \frac{y_{1 r} + y_{2 r}}{n})}^{T} \in ℝ^{r} .$

Drawing our inspiration from the work by N'Guessan [19] , we show that the linear system (13) has the vector $\hat{β} = {(1 - \hat{θ} Z^{T} M_{\hat{θ}}^{- 1} D_{y})}^{- 1} M_{\hat{θ}}^{- 1} D_{y}$ as unique so- lution. One shows that

$(1 - \hat{θ} Z^{T} M_{\hat{θ}}^{- 1} D_{y}) = 1 - \frac{1}{n} \sum_{m = 1}^{r} \frac{\hat{θ} z_{m} y_{\cdot m}}{1 + \hat{θ} z_{m}} .$

The components of the constrained MLE $\hat{ϕ}$ can then be computed as follows.

Corollary 1. The components of $\hat{ϕ} = (\hat{θ}, \hat{β})$ are given by

$\hat{θ} = \frac{\sum_{m = 1}^{r} y_{2 m}}{(\sum_{m = 1}^{r} {\hat{β}}_{m} z_{m}) \times (\sum_{m = 1}^{r} y_{1 m})}$ (14)

${\hat{β}}_{j} = \frac{1}{1 - Δ_{n} (\hat{θ})} \times \frac{1}{1 + \hat{θ} z_{j}} \times \frac{y_{. j}}{n}, j = 1, \dots, r$ (15)

where $Δ_{n} (\hat{θ}) = \frac{1}{n} \sum_{m = 1}^{r} \frac{\hat{θ} z_{m} y_{\cdot m}}{1 + \hat{θ} z_{m}}$ .

Applying Theorem 2’s results, we can give the asymptotic variance of the constrained MLE $\hat{ϕ}$ .

Corollary 2. The asymptotic variance of the components of the constrained maximum likelihood estimator $\hat{ϕ}$ is given by

${\hat{σ}}^{2} (\hat{θ}) = \frac{1}{n \hat{E} (Z)} \hat{θ} + \frac{1}{n} (1 + \frac{\hat{E} (Z^{2})}{{(\hat{E} (Z))}^{2}}) θ^{2} + \frac{\hat{E} (Z)}{n} θ^{3}$ (16)

${\hat{σ}}^{2} ({\hat{β}}_{j}) = \frac{1}{n} {\hat{β}}_{j} (1 - {\hat{β}}_{j}), j = 1, \dots, r$ (17)

where $\hat{E} (Z) = \sum_{m = 1}^{r} {\hat{β}}_{m} z_{m}$ and $\hat{E} (Z^{2}) = \sum_{m = 1}^{r} {\hat{β}}_{m} z_{m}^{2}$ .

Technical proof uses the Schur complement approach and stems from [17] . One shows (see the appendix) that the elements of the asymptotic information matrix $Γ_{ϕ}$ linked to $\hat{ϕ}$ are

$C_{ϕ} = {(0, 1, \dots, 1)}^{T} \in ℝ^{1 + r}, τ_{ϕ} = \frac{γ^{2} E (Z)}{n θ}, U_{ϕ} = \frac{γ^{2}}{n} Z, B_{ϕ} = γ (Σ_{ϕ} + V_{ϕ} V_{ϕ}^{T})$ (18)

where

$V_{ϕ} = {(\frac{θ γ}{n E (Z)})}^{1 / 2} Z, Σ_{ϕ} = \frac{n}{γ} \times diag (\frac{1}{β_{1}}, \dots, \frac{1}{β_{r}}), γ = \frac{n}{1 + θ E (Z)} .$ (19)

Setting $a_{n, ϕ} = γ$ and $b_{n, ϕ}^{2} = (γ^{3} E (Z)) / (n θ)$ , we show that Assumption 6 is satisfied. We then apply Theorem 2 to get the results of Corollary 2.

4.3. Practical Aspect of the Partial Linear Approximation Algorithm

Algorithm 2.

Step 0 (Initialization) Given $ϵ > 0$ and $D_{y}$ , set $θ^{(0)} = 0$ and compute $β^{(0)} = D_{y}$ .

Step 1 (Loop for computing $\hat{ϕ}$ ) For a given $k \geq 0$ ,

a) Compute $θ^{(k + 1)} = \frac{\sum_{m = 1}^{r} y_{2 m}}{(\sum_{m = 1}^{r} β_{m}^{(k)} z_{m}) \times (\sum_{m = 1}^{r} y_{1 m})}$

b) For $j = 1, \dots, r$ , compute

$β_{j}^{(k + 1)} = \frac{1}{1 - \frac{1}{n} \sum_{m = 1}^{r} \frac{θ^{(k + 1)} z_{m} y_{. m}}{1 + θ^{(k + 1)} z_{m}}} \times \frac{1}{1 + θ^{(k + 1)} z_{j}} \times \frac{y_{. j}}{n} .$

c) If $| l (ϕ^{(k + 1)}) - l (ϕ^{(k)}) | \geq ϵ$ , then replace $k$ by $k + 1$ and return to Step 1.

Else, set $\hat{θ} = θ^{(k + 1)}$ , $\hat{β} = β^{(k + 1)}$ and go to Step 2.

Step 2 (Computation of standard errors)

a) Compute ${\hat{σ}}^{2} (\hat{θ}) = \frac{1}{n \hat{E} (Z)} \hat{θ} + \frac{1}{n} (1 + \frac{\hat{E} (Z^{2})}{{(\hat{E} (Z))}^{2}}) θ^{2} + \frac{\hat{E} (Z)}{n} θ^{3}$ .

b) For $j = 1, \dots, r$ , compute ${\hat{σ}}^{2} ({\hat{β}}_{j}) = \frac{1}{n} {\hat{β}}_{j} (1 - {\hat{β}}_{j})$ .

The partial linear approximation algorithm for computing the constrained maximum likelihood estimator $\hat{ϕ}$ of the model presented in subsection 4.1 stems from the cyclic algorithm of N’Guessan and Geraldo [20] . The PLA proceeds as follows: step 1 allows to estimate $\hat{ϕ}$ alternating between its two components $\hat{θ}$ et $\hat{β}$ . To start the procedure, we initialize $θ^{(0)}$ . Then we compute $β_{j}^{(0)} (j = 1, \dots, r)$ and define $ϕ^{(0)} = (θ^{(0)}, β^{(0)})$ . But, we could also initialize $β^{(0)}$ using the problem’s data and get $θ^{(0)}$ . The process is repeated until the stopping criterion is satisfied. We note that our algorithm is automated and can be started as soon as the problem’s data $D_{y}$ is entered.

We can also note that the second partial derivatives of the log-likelihood function are no longer used in our algorithm. The aim of this paper is not to carry out a theoretical study of the convergence of the proposed algorithm. We rather focus on the numerical aspect of this convergence using simulated datasets.

5. Numerical Results with Simulated Datasets

5.1. Data Simulation Principle

For a given value of $r$ (the number of crash types), we generate the com-

ponents of vector $Z = {(z_{1}, \dots, z_{r})}^{T}$ from a uniform random variable $U (\frac{1}{2}, \frac{5}{2})$ .

The true value of $θ$ denoted $θ^{0}$ is fixed and the true value of $β$ , denoted $β^{0} = {(β_{1}^{0}, \dots, β_{r}^{0})}^{T}$ , such that $\sum_{j = 1}^{r} β_{j}^{0} = 1$ , comes from a uniform random variable $U (ϵ, 1 - ϵ)$ where $ϵ = 10^{- 5}$ . Using those values, we compute the true probabilities

$π_{1 j} (ϕ^{0}) = \frac{β_{j}^{0}}{1 + θ^{0} \sum_{m = 1}^{r} z_{m} β_{m}^{0}}, π_{2 j} (ϕ^{0}) = \frac{θ^{0} β_{j}^{0} \sum_{m = 1}^{r} z_{m} β_{m}^{0}}{1 + θ^{0} \sum_{m = 1}^{r} z_{m} β_{m}^{0}}, j = 1, \dots, r$

linked to the multinomial distribution presented in subsection 4.1. Finally, one generates the total number $n$ of crash data from a Poisson distribution and then randomly shares it between the before and after periods using probabilities $π_{1 j} (ϕ^{0})$ and $π_{2 j} (ϕ^{0})$ . The observed values of $y_{i j}$ such that $\sum_{i = 1}^{2} \sum_{j = 1}^{r} y_{i j} = n$ are then found.

5.2. Numerical Results

This subsection deals with the numerical convergence of the partial linear approximation algorithm. As usual in the study of iterative algorithms, we analyse the influence of the initial solution $ϕ^{(0)} = (θ^{(0)}, β^{(0)})$ , the number of iterations, the computation time (CPU time) and the mean squared error. The performances of the partial linear approximation algorithm are compared to those of some classical optimization methods available in R and MATLAB software. The computations presented in this section were executed on a PC with an AMD E-350 Processor 1.6 GHz CPU.

The methods selected for comparison are the Newton-Raphson’s method, Nelder-Mead’s (NM) simplex algorithm [21] , quasi-Newton BFGS algorithm (from the names of its authors Broyden, Fletcher, Goldfarb and Shanno [22] [23] [24] [25] ), Interior Point (IP) algorithm [26] , the Lenverberg-Marquardt (LM) algorithm [27] [28] and Trust Region (TR) algorithms [29] . In our work, the BFGS and NM algorithms are implemented using the package developed by Varadhan [30] .

The simulation process was performed on many simulated crash datasets. For each one, small and large values of $n$ were considered. The results presented here correspond to the case $r = 3$ , $n \in {50; 5000}$ , $ϕ^{0} = (θ^{0}, β^{0})$ with $θ^{0} = 0.6$ and $β^{0} = (0.025, 0.232, 0.743)$ .

Three different ways of setting $β^{(0)}$ were considered:

1) Uniform initialisation: $β_{1}^{(0)} = β_{2}^{(0)} = \dots = β_{r}^{(0)} = 1 / r$ .

2) Proportional initialisation: $β_{j}^{(0)} = \frac{y_{1 j} + y_{2 j}}{n}$ , $j = 1, \dots, r$ .

3) Random initialisation: for $j = 1, \dots, r$ , $β_{j}^{(0)} = u_{j} / (\sum_{i = 1}^{r} u_{i})$ where each

$u_{i} (i = 1, \dots, r)$ is randomly generated from an uniform distribution $U (0.05, 0.95)$ .

By combining the two values of $n$ and the three ways to initialize $β^{(0)}$ , we get six scenarios and for each one, we performed 1000 replications. The stopping criterion is the condition $| l (ϕ^{(k + 1)}) - l (ϕ^{(k)}) | < 10^{- 5}$ .

Tables 1-6 present a few of the results obtained for an overall 6000 simu- lations. All computation times are given in seconds and the duration ratio of an

Table 1. Results for uniform initialisation of $β^{(0)}$ , $r = 3$ , $n = 50$ .

Table 2. Results for uniform initialisation of $β^{(0)}$ , $r = 3$ , $n = 5000$ .

Table 3. Results for proportional initialisation of $β^{(0)}$ , $r = 3$ , $n = 50$ .

Table 4. Results for proportional initialisation of $β^{(0)}$ , $r = 3$ , $n = 5000$ .

algorithm is defined as the ratio between the mean computation time of this latter and the mean computation time of the PLA (therefore the duration ratio of the partial linear approximation algorithm always equals 1). The computation time depends on the computer used to perform the simulations while the duration ratio is computer-free and therefore more useful.

To analyse the convergence, we used the mean squared error (MSE) defined as:

Table 5. Results for random initialisation of $β^{(0)}$ , $r = 3$ , $n = 50$ .

Table 6. Results for random initialisation of $β^{(0)}$ , $r = 3$ , $n = 5000$ .

$MSE (\hat{ϕ}, ϕ^{0}) = \frac{1}{1 + r} ({(\hat{θ} - θ^{0})}^{2} + \sum_{j = 1}^{r} {({\hat{β}}_{j} - β_{j}^{0})}^{2})$ (20)

One can see that the MSE decreases when $n$ (the total number of road crashes) increases. The PLA is at least as efficient as the other algorithms. It always converges to reasonable and acceptable parameter vector estimates and the estimate gets closer to the the true value as $n$ increases.

For a small value of $n$ (see Table 1, Table 3 and Table 5), the estimate $\hat{ϕ}$ produced by the PLA is relatively close to the true parameter vector and quite close to those of the other methods. More generally, all the compared methods have a MSE of order $10^{- 2}$ . However the estimates produced by the BFGS and Nelder-Mead’s methods are very far from the true values (see Table 5). In the case of random initialisation, the MSE for Nelder-Mead’s and BFGS algorithms is 20 times greater than those of the other algorithms.

When $n$ increases $(n = 5000)$ , the MSE of the PLA decreases from $10^{- 2}$ to $10^{- 4}$ . So the PLA is also efficient. Unsurprisingly, when $n$ is very great the estimates produced by the other algorithms are closer to the true values than those of the PLA. This is expected because the other algorithms work with the exact gradient. However, the Nelder-Mead’s and BFGS algorithms produce estimates very far from the truth. Their MSE’s order is $10^{- 2}$ (Table 4 and Table 6) while those of the PLA and the other methods are $10^{- 4}$ .

To analyse the influence of the initial guess, we considered the mean number of iterations and the amplitude of the iterations (i.e. difference between the maximum and the minimum number of iterations). An increase in the am- plitude of iterations suggests a greater influence of the initial solution $ϕ^{(0)}$ . The results given in Tables 1-6 suggest that the PLA is stable and robust to initial guesses of the parameter being estimated. For the 6000 replications, the number of iterations needed by the PLA to converge lies between 2 and 4. In other words, setting the initial guess to 6000 different values chosen in the parameter vectors space does not disturb the PLA. This performance is as good as those of Newton- Raphson and Levenberg-Marquardt and by far better than those of BFGS, Nelder-Mead’s and Interior Point which have their number of iterations varying respectively from 1 to 15, 1 to 21 and 3 to 31.

As far as the computation time is concerned, it can be noticed that all the CPU time ratio are greater than 1 which means that none of the compared algorithms is faster than the PLA. On average, the PLA is 4 to 5 times quicker than Newton-Raphson, 239 to 345 times quicker than BFGS algorithm, 354 to 395 times quicker than Nelder-Mead, 27 to 31 times quicker than Levenberg- Marquardt, 33 to 39 times quicker than Trust region algorithms and 311 to 383 times quicker than Interior point algorithms. This can be an important factor when larger values of $r$ are considered.

6. Real-Data Analysis

We apply the partial linear approximation algorithm to the data concerning the changes applied to road markings on a rural Nord Pas-de-Calais road (France) [31] . This road lay-out consisted in what is called “Italian marking”. The road markings were modified in order to make overtaking impossible in both directions at the same time. On a short distance, there are two lanes for the first direction (overtaking is then allowed in that direction) and there is only one for the other direction. Both directions are separated by a white line. A bit farther away, the system is inverted so that overtaking is possible in the opposite direction, and so on. The data recorded for eight years (four years before and four years after) on the marked area are given by Table 7.

Note that the number of fatal crashes (Fatal) decreased from four (in the before period) to one after the lay-out change. Indeed there were four accidents with at least one person killed in the period before the road works and one crash with at least one person killed in the period after. The number of slight crashes (no serious bodily injuries involved) recorded in the same period were more than halved. For the same lengths of time, on a portion of National Road 17 used as a control area, accident reports are the following.

The values given in Table 8 are obtained by dividing, for each accident type, the number of crashes after the changes by the number of crashes before. On the whole, a decrease can be noted in comparison with the control area accident numbers between the 4-year period after the changes and the 4-year period before. All the algorithms can be applied to Table 7 and Table 8. Since the simulation results suggested that the partial linear approximation algorithm con- verges after a few iterations and remains steady, we only present the estimations related to the partial linear approximation algorithm (Table 9).

Parameters $β_{1}$ , $β_{2}$ and $β_{3}$ enable us to assess the risks for each type of accident on the test area in the eight years when the road markings’ effects are studied. Estimated values ${\hat{β}}_{1}$ , ${\hat{β}}_{2}$ and ${\hat{β}}_{3}$ are respectively $0.1525$ , $0.1605$ and $0.6870$ . These values suggest that in the eight years of road marking analysis, $15.25 %$ of the crashes recorded in the test area were fatal, $16.05 %$ were serious and $68.70 %$ were slight as compared to the crashes recorded in the control area.

The estimated mean efficiency index $(\hat{θ})$ is 0.7054. It corresponds to an average decrease in proportion of $29.46 % = (1 - 0.7054) \times 100 %$ of the whole

Table 7. Crash data for experimental zone.

Table 8. Crash data for control zone.

Table 9. Pas-De-Calais’s crash data using partial linear approximation algorithm.

set of accidents in the test area as compared to the average trend in the control area. The mean effect significance for this type of lay-out may also be tested by using the confidence interval at the $95 %$ level associated to parameter $θ$ . This confidence interval reveals a bracket of values whose lower (resp. upper) bound is strictly superior to 0 (resp. 1). Indeed, as $1 \in [0.1646, 1.2463]$ , we cannot rule out the hypothesis $H_{0} : θ = 1$ versus $H_{1} : θ \neq 1$ with a type-1 error of $5%$ . Even if in the case studied here, we can notice a decrease in proportion of $29.46 % = (1 - 0.7054) \times 100 %$ in the average accident number in the test area, the above test shows that the mean efficiency index value is not significantly different from 1 to enable us to conclude that this type of road marking is efficient. In practice, an analysis according to periods, recorded data and control area should be carried out in order to get more appropriate conclusions.

7. Concluding Remarks

We propose in this paper, under assumptions, a principle of 2-block splitting of a constrained Maximum Likelihood (ML) system of equations linked to a parameter vector. We obtain analytical approximations of the components of the constrained ML estimator. The standard asymptotic errors are also obtained in closed-form.

We then build an iterative algorithm by initializing the first component of the parameter vector without inverting (at each iteration) the information matrix nor the Hessian matrix. Our partial linear approximation algorithm cycles through the components of the parameter vector and updates one component at a time. It is very simple to program and the constraints are integrated easily. To implement our algorithm, we use a particular version of the multinomial model of N’Guessan et al. [18] used to estimate the average effect of a road safety measure and the different accident risks when road conditions are improved. We prove that the assumptions are all satisfied and we obtain simple expressions of the estimators and their asymptotic variance. Afterwards, we give a practical version of our algorithm.

The numerical performance of the proposed algorithm is examined through a simulation study and compared to those of classical methods available in R and MATLAB software. The choice of these other methods is dictated not only by the fact that they are relevant and integrated in most statistical software but also by the fact that some need second order derivatives and others do not. The comparisons suggest that not only the partial linear approximation algorithm is competitive in statistical accuracy and computational speed with the best currently available algorithm, but also it is not disturbed by the initial guess.

The link between the numerical performance of our algorithm and the particular model used in this paper seems to be a limitation to the com- petitiveness of our algorithm with regards to the other methods considered in this paper. However, this particular choice of model allows not only to show the feasibility of our algorithm but also represents a good basis for approaching, under additional conditions, more general models.

The simulations results obtained on the particular model considered in this paper suggest that our algorithm may be extended to other families of multi- nomial models (such as the model of N’Guessan et al. [18] ). Drawing our inspiration from [32] , we may also prove the asymptotic strong consistency of the estimator obtained from our algorithm. This perspective will give a wider interest to our algorithm in the context of a large-scale use.

Acknowledgements

We thank the Editor and the anonymous referee for their remarks and suggestions which enabled a substantial improvement of this paper. This article was partially written while the second author was visiting the Lille 1 University (France).

Cite this paper

N’Guessan, A., Geraldo, I.C. and Hafidi, B. (2017) An Approximation Method for a Maximum Likelihood Equation System and Application to the Analysis of Accidents Data. Open Journal of Statistics, 7, 132-152. http://dx.doi.org/10.4236/ojs.2017.71011

References

1. Hauer, E. (1997) Observational Before-After Studies in Road Safety Estimating the Effect of Highway Traffic Engineering Measures on Road Safety. Pergamon, Oxford.

2. N’Guessan, A., Essai, A. and Langrand, C. (2001) Estimation multidimensionnelle des controles et de l’effet moyen d’une mesure de sécurité routière. Revue de Statistique Appliquée, 49, 85-102.

3. Lord, D. and Mannering, F. (2010) The Statistical Analysis of Crash-Frequency Data: A Review and Assessment of Methodological Alternatives. Transport Research Part A, 44, 291-305.

4. Aitchison, J. and Silvey, S.D. (1958) Maximum Likelihood Estimation of Parameters Subject to Restraints. The Annals of Mathematical Statistics, 29, 813-828. https://doi.org/10.1214/aoms/1177706538

5. Crowder, M. (1984) On the Constrained Maximum Likelihood Estimation with Non-I.I.D Observations. Annals of the Institute of Statistical Mathematics, 36, 239-249. https://doi.org/10.1007/BF02481968

6. Haber, M. and Brown, M.B. (1986) Maximum Likelihood Methods for Log-Linear Models When Expected Frequencies Are Subject to Linear Constraints. Journal of the American Statistical Association, 81, 477-482. https://doi.org/10.1080/01621459.1986.10478293

7. Lange, K. (2010) Numerical Analysis for Statisticians. 2nd Edition, Springer, Berlin. https://doi.org/10.1007/978-1-4419-5945-4

8. Li, Z.F., Osborne, M.R. and Prvan, T. (2003) Numerical Algorithms for Constrained Maximum Likelihood Estimation. ANZIAM Journal, 45, 91-114. https://doi.org/10.1017/S1446181100013171

9. Slivapulle, M.J. and Sen, P.K. (2005) Constrained Statistical Inference. Wiley, Hoboken.

10. Wang, Y. (2007) Maximum Likelihood Computation Based on the Fisher Scoring and Gauss-Newton Quadratic Approximations. Computational Statistics & Data Analysis, 51, 3776-3787. https://doi.org/10.1016/j.csda.2006.12.037

11. Mkhadri, A., N’Guessan, A. and Hafidi, B. (2010) An MM Algorithm for Constrained Estimation in a Road Safety Measure Modeling. Communications in Statistics-Simulation and Computation, 39, 1057-1071. https://doi.org/10.1080/03610911003778119

12. El Barmi, H. and Dykstra, R.L. (1998) Maximum Likelihood Estimates via Duality for Log-Convex Models When Cell Probabilities Are Subject to Convex Constraints. The Annals of Statistics, 26, 1878-1893. https://doi.org/10.1214/aos/1024691361

13. Matthews, G.B. and Crowther, N.A.S. (1995) A Maximum Likelihood Estimation Procedure When Modelling in Terms of Constraints. South African Statistical Journal, 29, 29-51.

14. Liu, C. (2000) Estimation of Discrete Distributions with a Class of Simplex Constraints. Journal of the American Statistical Association, 95, 109-120. https://doi.org/10.1080/01621459.2000.10473907

15. Ouellette, D.V. (1981) Schur Complement and Statistics. Linear Algebra and Its Applications, 36, 187-295. https://doi.org/10.1016/0024-3795(81)90232-9

16. Zhang, F. (2005) The Schur Complement and Its Applications. Springer, Berlin. https://doi.org/10.1007/b105056

17. N’Guessan, A. and Langrand, C. (2005) A Covariance Components Estimation Procedure When Modelling a Road Safety Measure in Terms of Linear Constraints. Statistics, 39, 303-314. https://doi.org/10.1080/02331880500108544

18. N’Guessan, A., Essai, A. and N’zi, M. (2006) An Estimation Method of the Average Effect and the Different Accident Risks When Modelling a Road Safety Measure: A Simulation Study. Computational Statistics & Data Analysis, 51, 1260-1277. https://doi.org/10.1016/j.csda.2005.09.002

19. N’Guessan, A. (2010) Analytical Existence of Solutions to a System of Non-Linear Equations with Application. Journal of Computational and Applied Mathematics, 234, 297-304. https://doi.org/10.1016/j.cam.2009.12.026

20. N’Guessan, A. and Geraldo, I.C. (2015) A Cyclic Algorithm for Maximum Likelihood Estimation Using Schur Complement. Numerical Linear Algebra with Applications, 22, 1161-1179. https://doi.org/10.1002/nla.1999

21. Nelder, J.A. and Mead, R. (1965) A Simplex Algorithm for Function Minimization. Computer Journal, 7, 308-313. https://doi.org/10.1093/comjnl/7.4.308

22. Broyden, C.G. (1970) The Convergence of a Class of Double-Rank Minimization Algorithms. Journal of the Institute of Mathematics and Its Applications, 6, 76-90. https://doi.org/10.1093/imamat/6.1.76

23. Fletcher, R. (1970) A New Approach to Variable Metric Algorithms. The Computer Journal, 13, 317-322. https://doi.org/10.1093/comjnl/13.3.317

24. Goldfarb, D. (1970) A Family of Variable Metric Updates Derived by Variational Means. Mathematics of Computation, 24, 23-26. https://doi.org/10.1090/S0025-5718-1970-0258249-6

25. Shanno, D.F. (1970) Conditioning of Quasi-Newton Methods for Function Minimization. Mathematics of Computation, 24, 647-656. https://doi.org/10.1090/S0025-5718-1970-0274029-X

26. Waltz, R.A., Morales, J.L., Nocedal, J. and Orban, D. (2006) An Interior Algorithm for Nonlinear Optimization That Combines Line Search and Trust Region Steps. Mathematical Programming, 107, 391-408. https://doi.org/10.1007/s10107-004-0560-5

27. Levenberg, K. (1944) A Method for the Solution of Certain Problems in Least Squares. Quarterly of Applied Mathematics, 2, 164-168. https://doi.org/10.1090/qam/10666

28. Marquardt, D. (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM Journal on Applied Mathematics, 11, 431-441. https://doi.org/10.1137/0111030

29. Byrd, R.H., Schnabel, R. and Shultz, G.A. (1988) Approximate Solution of the Trust Region Problem by Minimization over Two-Dimensional Subspaces. Mathematical Programming, 40, 247-263. https://doi.org/10.1007/BF01580735

30. Varadhan, R. (2011) Alabama: Constrained Nonlinear Optimization. R Package Version 9-1.

31. N’Guessan, A. and Truffier, M. (2008) Impact d’un aménagement de sécurité routière sur la gravité des accidents de la route. Journal de la Société Francaise de Statistiques, 149, 23-41.

32. Geraldo, I.C., N’Guessan, A. and Gneyou, K.E. (2015) A Note on the Strong Consistency of a Constrained Maximum Likelihood Estimator Used in Crash Data Modelling. Comptes Rendus Mathematique, 353, 1147-1152. https://doi.org/10.1016/j.crma.2015.09.025

8. Appendix

8.1. Proof of Lemma 1

Proof. Problem (3) is equivalent to maximizing function

$L (ϕ, λ) = l (ϕ) - λ C (ϕ)$ (21)

where $λ$ is the Lagrange multiplier. Any solution $\hat{ϕ}$ must satisfy

${(\frac{\partial L}{\partial θ})}_{\hat{ϕ}} = 0 and {(\frac{\partial L}{\partial β_{j}})}_{\hat{ϕ}} = 0, j = 1, \dots, r .$ (22)

From assumption 3, system (22) is equivalent to

${(\frac{\partial l}{\partial θ})}_{\hat{ϕ}} = 0 and {(\frac{\partial l}{\partial β_{j}})}_{\hat{ϕ}} - \hat{λ} {(\frac{\partial C}{\partial β_{j}})}_{\hat{ϕ}} = 0, j = 1, \dots, r$ (23)

where $\hat{λ} = λ (\hat{ϕ})$ . After multiplication by ${\hat{β}}_{j}$ and summation on the index $j$ , we get:

$\sum_{j = 1}^{r} {\hat{β}}_{j} {(\frac{\partial l}{\partial β_{j}})}_{\hat{ϕ}} - \hat{λ} \sum_{j = 1}^{r} {\hat{β}}_{j} {(\frac{\partial C}{\partial β_{j}})}_{\hat{ϕ}} = 0.$

that is equivalent to

$〈 \hat{β}, {(\frac{\partial l}{\partial β})}_{\hat{ϕ}} 〉 - \hat{λ} 〈 \hat{β}, C_{\hat{β}} 〉 = 0.$

We obtain (6) by substitution of $\hat{λ}$ in (23). □

8.2. Proof of Theorem 2

Proof. Under conditions 6, $Γ_{ϕ}$ is non singular and after some matrix mani- pulations, we get

$\begin{array}{l} Γ_{ϕ}^{- 1} = [\begin{matrix} W_{ϕ} & J_{ϕ}^{- 1} C_{ϕ}^{T} R_{ϕ}^{- 1} \\ R_{ϕ}^{- 1} C_{ϕ} J_{ϕ}^{- 1} & - R_{ϕ}^{- 1} \end{matrix}], \\ J_{ϕ}^{- 1} = [\begin{matrix} {(J_{ϕ} / B_{ϕ})}^{- 1} & - {(J_{ϕ} / B_{ϕ})}^{- 1} U_{ϕ}^{T} B_{ϕ}^{- 1} \\ - B_{ϕ}^{- 1} U_{ϕ} {(J_{ϕ} / B_{ϕ})}^{- 1} & {(J_{ϕ} / τ_{ϕ})}^{- 1} \end{matrix}] \end{array}$

where $R_{ϕ}^{- 1} = a_{n, ϕ} {‖ C_{β} ‖}_{Σ_{ϕ}^{- 1}}^{- 2}$ is a scalar, $B_{ϕ}^{- 1} = a_{n, ϕ}^{- 1} (Σ_{ϕ}^{- 1} - {(1 + {‖ V_{ϕ} ‖}_{Σ_{ϕ}^{- 1}}^{2})}^{- 1} Σ_{ϕ}^{- 1} V_{ϕ} V_{ϕ}^{T} Σ_{ϕ}^{- 1})$

is the inverse of $B_{ϕ}$ , ${(J_{ϕ} / B_{ϕ})}^{- 1} = b_{n, ϕ}^{- 2} a_{n, ϕ} (1 + {‖ V_{ϕ} ‖}_{Σ_{ϕ}^{- 1}}^{2})$ , is a scalar,

${(J_{ϕ} / τ_{ϕ})}^{- 1} = a_{n, ϕ}^{- 1} Σ_{ϕ}^{- 1}$ , is a $r \times r$ matrix and $W_{ϕ} = J_{ϕ}^{- 1} - J_{ϕ}^{- 1} C_{ϕ}^{T} R_{ϕ}^{- 1} C_{ϕ} J_{ϕ}^{- 1}$ is the $(1 + r) \times (1 + r)$ asymptotic variance-covariance matrix of $ϕ$ . We show that

$W_{ϕ} = [\begin{matrix} W_{ϕ} (1, 1) & W_{ϕ}^{T} (2, 1) \\ W_{ϕ} (2, 1) & W_{ϕ} (2, 2) \end{matrix}]$

where $W_{ϕ} (1, 1)$ is a scalar and $W_{ϕ} (2, 2)$ is a $r \times r$ matrix. The asymptotic variance of $\hat{θ}$ is given by $W_{ϕ} (1, 1)$ and ${\hat{σ}}^{2} ({\hat{β}}_{j})$ is obtained by the diagonal elements of $W_{ϕ} (2, 2)$ . After straightforward calculation, we get

$W_{ϕ} (1, 1) = {(J_{ϕ} / B_{ϕ})}^{- 1} - R_{ϕ}^{- 1} ξ_{ϕ}^{2} {(J_{ϕ} / B_{ϕ})}^{- 2} \frac{b_{n, ϕ}^{2} a_{n, ϕ}^{- 2}}{{(1 + {‖ V_{ϕ} ‖}_{Σ_{ϕ}^{- 1}}^{2})}^{2}}$

and

$W_{ϕ} (2, 2) = a_{n, ϕ}^{- 1} (Σ_{ϕ}^{- 1} - {‖ C_{β} ‖}_{Σ_{ϕ}^{- 1}}^{- 2} Σ_{ϕ}^{- 1} C_{β} C_{β}^{T} Σ_{ϕ}^{- 1})$

where $ξ_{ϕ} = V_{ϕ}^{T} Σ_{ϕ}^{- 1} C_{β}$ . The results of Theorem 2 follow from a direct calculation.

□

8.3. Proof of Corollary 2

Proof. Taking the second derivatives of the negative of $L$ with respect to the

components of $ϕ$ and $λ$ , we show that: $\frac{\partial^{2} L}{\partial λ^{2}} = \frac{\partial^{2} L}{\partial λ \partial θ} = 0$ , $\frac{\partial^{2} L}{\partial λ \partial β_{j}} = - 1$ , $\frac{\partial^{2} L}{\partial θ^{2}} = - \frac{y_{2.}}{θ^{2}} + \frac{n {(E (Z))}^{2}}{{(1 + θ E (Z))}^{2}}$ , $\frac{\partial^{2} L}{\partial θ \partial β_{j}} = - \frac{n z_{j}}{{(1 + θ E (Z))}^{2}}$ and

$\frac{\partial^{2} L}{\partial β_{j} \partial β_{k}} = {\begin{array}{l} - \frac{y_{. j}}{β_{j}^{2}} + \frac{n θ^{2} z_{j}^{2}}{{(1 + θ E (Z))}^{2}} - \frac{y_{2.} z_{j}^{2}}{{(E (Z))}^{2}} & if j = k \\ \frac{n θ^{2} z_{j} z_{k}}{{(1 + θ E (Z))}^{2}} - \frac{y_{2.} z_{j} z_{k}}{{(E (Z))}^{2}} & if j \neq k \end{array}$

for $j$ , $k = 1, \dots, r$ . Now, setting $E (y_{. j}) = n β_{j}$ , $E (y_{2.}) = n θ E (Z) / (1 + θ E (Z))$ and $γ = n / (1 + θ E (Z))$ , we obtain the components of matrix $J_{ϕ}$ as follows

$τ_{ϕ} = \frac{γ^{2} E (Z)}{n θ}, U_{ϕ, j} = \frac{γ^{2} z_{j}}{n}, B_{ϕ, j j} = γ (\frac{n}{γ β_{j}} + \frac{γ θ^{2} z_{j}^{2}}{n E (Z)}) and B_{ϕ, j k} = \frac{θ γ^{2} z_{j} z_{k}}{n E (Z)}, (j \neq k)$

where $U_{ϕ, j}$ (resp. $B_{ϕ, j j}$ and $B_{ϕ, j k}$ ) are the components of vector $U_{ϕ}$ (resp. the elements of matrix $B_{ϕ}$ ). Using the last expressions of the elements of $U_{ϕ}$ and $B_{ϕ}$ and after straightforward calculation, we show that

$W_{ϕ} (2, 2) = γ^{- 1} Σ_{ϕ}^{- 1} - \frac{1}{n} β β^{T}$ and then, we deduce the expression of ${\hat{σ}}^{2} ({\hat{β}}_{j})$

given by corollary 2. In the same way, we obtain

$W_{ϕ} (1, 1) = \frac{n θ}{γ^{2} t_{ϕ} E (Z)} - \frac{θ^{2}}{n}$

where $t_{ϕ} = [n^{2} E (Z)] / [n^{2} E (Z) + θ γ^{2} \hat{E} (Z^{2})]$ . Using the last expression of $t_{ϕ}$ , we get the expression of ${\hat{σ}}^{2} (\hat{θ})$ . □

Journal Menu >>