In this paper, we consider simulated minimum Hellinger distance (SMHD) inferences for count data. We consider grouped and ungrouped data and emphasize SMHD methods. The approaches extend the methods based on the deterministic version of Hellinger distance for count data. The methods are general, it only requires that random samples from the discrete parametric family can be drawn and can be used as alternative methods to estimation using probability generating function (pgf) or methods based matching moments. Whereas this paper focuses on count data, goodness of fit tests based on simulated Hellinger distance can also be applied for testing goodness of fit for continuous distributions when continuous observations are grouped into intervals like in the case of the traditional Pearson’s statistics. Asymptotic properties of the SMHD methods are studied and the methods appear to preserve the properties of having good efficiency and robustness of the deterministic version.
Nonnegative discrete parametric families of distributions are useful for modeling count data. Many of these families do not have closed form probability mass functions nor closed form formulas to express the probability mass function (pmf) recursively. Their pmfs can only be expressed using an infinite series representation but their corresponding Laplace transforms have a closed form and, in many situations, they are relatively simple. Probability generating functions are often used for discrete distributions but Laplace transforms are equivalent and can also be used. In this paper, we use Laplace transforms but they will be converted to probability generating functions (pgfs) whenever the need arises to link with results which already appear in the literature. We begin with a few examples to illustrate the situation often encountered when new distributions are created.
Example 1 (Discrete stable distributions) The random variable X ≥ 0 follows a positive stable law if the probability generating function and Laplace transform are given respectively as
P β ( s ) = E ( s X ) = e − λ ( 1 − s ) α , 0 < α ≤ 1 , λ > 0 , β = ( λ , α ) ′ , | s | ≤ 1
and
φ β ( s ) = E ( e − s X ) = e − λ ( 1 − e − s ) α , 0 < α ≤ 1 , λ > 0 , β = ( λ , α ) ′ , s ≥ 0 .
The distribution was introduced by Christoph and Schreiber [
It is easy to see that φ β ( s ) = P β ( e − s ) .
The Poisson distribution can be obtained by fixing α = 1 . The distribution is infinitely divisible and displays long tail behavior. The recursive formula for its mass function has been obtained; see expression (8) given by Christoph and Schreiber [
Now if we allow λ to be a random variable with an inverse Gaussian distribution whose Laplace transform is given by h ( s ) = e μ ( 1 − 1 + 2 μ s ) , s ≥ − μ 2 , a mixed
nonnegative discrete stable distribution can be created with Laplace transform given by
φ β ( s ) = ∫ 0 ∞ ( g ( s ) ) λ d H ( λ ) ,
where g ( s ) = e − ( 1 − s ) α and H ( λ ) is the distribution with Laplace transform h ( s ) . The resulting Laplace transform,
φ β ( s ) = exp ( μ ( 1 − 1 + 2 μ ( 1 − e − s ) α ) ) ,
is the Laplace transform of a nonnegative infinitely divisible (ID) distribution.
We can see that it is not always straightforward to find the recursive formula for the pmf for a nonnegative count distribution. Even if it is available, it might still complicated to be used numerically for inferences meanwhile the Laplace transform or pgf can have a relatively simple representation.
We can observe that the new distribution is obtained by using the inverse Gaussian distribution as a mixing distribution. This is also an example of the use of a power mixture (PM) operator to obtain a new distribution. The PM operator will be further discussed in Section 1.2.
From a statistical point of view, when neither a closed form pmf nor a recursive formula for the pmf exists, maximum likelihood estimation can be difficult to implement.
The power mixture operator was introduced by Abate and Whitt [
Definition 1.1.3. A nonnegative random variable X is infinitely divisible if its Laplace transform can be written as
ψ ( s ) = ( k n ( s ) ) n , n = 1 , 2 , ⋯ ,
where k n ( s ) also is the Laplace transform of a random variable. In many situations, k n ( s ) and ψ ( s ) belong to the same parametric family. See Panjer and Willmott [
Abate and Whitt [
Suppose that X t is an infinitely divisible nonnegative discrete random variable such that the Laplace transform can be expressed as ( κ ( s ) ) t , t ≥ 0 , where κ ( s ) is the Laplace transform of X, which is nonnegative and infinitely divisible as well. The power mixture (PM) with mixing distribution function H ( y ) and Laplace transform κ H ( s ) of a nonnegative random variable Y is defined as the Laplace transform
η ( s ) = PM ( κ , H ) = ∫ 0 ∞ ( κ ( s ) ) t d H ( t ) = κ H ( − log ( κ ( s ) ) ) .
Furthermore, if H ( y ) is infinitely divisible, then the distribution with Laplace transform η ( s ) is also infinitely divisible. The random variable Y ≥ 0 with distribution H ( y ) can be discrete or continuous but needs to be ID. This is the PM method for creating new parametric families, i.e., using the PM operator. The PM method can be viewed as a form of continuous compounding method. The ID property can be dropped but as a result the new distribution created using the PM operator needs not be ID. For the traditional compounding methods, see Klugman et al. [
Example 2 (Generalized negative binomial) The generalized negative binomial (GNB) distribution introduced by Gerber [
κ H ( s ) = exp { − λ ( θ + s ) α − θ α } , λ , θ > 0 , 0 < α < 1 .
Gerber [
Let κ ( s ) = e ( e − s − 1 ) be the Laplace transform of a Poisson distribution with rate μ = 1 . The Laplace transform of the GNB distribution can be represented as
η ( s ) = exp ( − λ ( θ − e − s + 1 ) α − θ α ) .
The corresponding pgf can be expressed as
P ( s ) = exp ( − λ ( θ − s + 1 ) α − θ α ) .
The pgf is given by expression (21) in the paper by Gerber [
Note that, if H ( y ) is discrete, η ( s ) is the Laplace transform of a random variable expressible as a random sum. A random sum is also called stopped sum in the literature, see chapter 9 by Johnson et al. [
Example 3 Let X = ∑ i = 1 Y U i ,the U i ’s conditioning on Y are independent and identically distributed and follows a Poisson distribution with rate ф and Y is distributed with a Poisson distribution with rate λ . Using the Power mixture operator we conclude that the LT for X is
η ( s ) = exp ( λ ( e ф ( e − s − 1 ) − 1 ) ) ,
and the pgf is
P ( s ) = exp ( λ ( e ф ( s − 1 ) − 1 ) ) .
Properties and applications of the Neymann type A distribution have been studied by Johnson et al. [
By tilting the density function using the Esscher transform, the Esscher transform operator can be defined and, provided the tilting parameter τ introduced is identifiable, new distributions can be created from existing ones.
Let X be the original random variable with Laplace transform κ ( s ) . The Esscher transform operator which can be viewed as a tilting operator is defined as
η ( s ) = Esscher ( κ , τ ) = κ ( s + τ ) κ ( τ ) .
Let κ ( s ) be the Laplace transform of a positive continuous random variable X. The Laplace transform of Y = X − τ , Y ≥ τ ≥ 0 is given by e τ κ ( s ) . So, we can define the shift operator as
η ( s ) = Shift ( κ , τ ) = e τ κ ( s ) .
In some cases, even the pmf of Y has a closed form but the maximum likelihood (ML) estimators might be attained at the boundaries, the ML estimators might not have the regular optimum properties.
Note that parallel to the closed form pgf expressions for these new discrete distributions, it is often simple to simulate from the new distributions if we can simulate from the original distribution before the operators are applied. For example, let us consider the new distribution obtained by using the Esscher operator. It suffices to simulate from the distribution before applying the operator and apply the acceptance-rejection method to obtain a sample from the Esscher transformed distribution. The situation is similar for new distributions created by the PM operator. If we can simulate one observation from the mixing distribution of Y which gives a realized value t and if it is not difficult to draw one observation from the distribution with LT κ ( s ) t then combining these two steps, we would be able to obtain one observation from the new distribution created by the PM operator. Consequently, simulated methods of inferences offer alternative methods to inferences methods based on matching selected points of the empirical pgf with its model counterpart or other related methods, see Doray et al. [
This leads us to look for alternative methods such as the simulated minimum Hellinger distance (SMHD) methods for count data. We shall consider grouped count data and ungrouped count data. With grouped data, it leads to simulated chi-square type statistics which can be used for model testing for discrete or continuous models. These statistics are similar to the traditional Pearson statistics. For model testing with continuous distributions, continuous observations when grouped into intervals are reduced to count data and we do not need to integrate the model density functions on intervals using SMHD methods, it suffices to simulate from the continuous model and construct sample distribution functions to obtain estimate interval probabilities. Therefore, the scopes of applications of simulated methods are widened due to these features.
We briefly describe the classical minimum Hellinger distance methods introduced by Simpson [
It is worth to mention that simulated methods of inferences are relatively recent. In advanced econometrics textbook such as the book by Davidson and McKinnon [
Assume that we have a random sample of n independent and identically distributed
(iid) nonnegative observations X 1 , ⋯ , X n from a pmf p θ ( x ) with x = 0 , 1 , ⋯ and
θ = ( θ 1 , ⋯ , θ m ) ′
is the vector of parameters of interest, θ 0 is the vector of the true parameters. If the data are grouped into r = k + 1 disjoint intervals
where
by
and show that the estimators satisfy the regularity conditions of their Theorem 3.1 and 3.3 which lead to conclude that the simulated estimators are consistent and have an asymptotic normal distribution. As we already know, a weighted version can be more efficient, if we attempt a version S for the Pearson’s chi square distance,
and since the denominator of the summand involves
distance will run into numerical difficulties. The traditional and deterministic version of the Hellinger distance as given by
is more appropriate for a version S and it is already known that it generates minimum HD estimators which are as efficient as the minimum chi-square estimators or maximum likelihood (ML) estimators for grouped data, see
Cressie-Read divergence measure with
Note that
so that
Since the objective function remains bounded and this property continues to hold for the ungrouped data case, this suggests that SMHD methods could preserve some of the nice robustness properties of version D.
For ungrouped data, it is equivalent to have grouped data but using intervals with unit length
Note that for a data set the sum given by the RHS of the above expression only has a finite number of terms as
The version D with
has been investigated by Simpson [
Simpson [
SMHD methods appear to be useful for actuarial studies when there is a need for fitting discrete risk models, see chapter 9 of Panjer and Willmott [
In this paper, we develop unified simulated methods of inferences for grouped and ungrouped count data using HD distances and it is organized as follows. Asymptotic properties for SMHD methods are developed in Section 2 where consistency and asymptotic normality are shown in section 2.2. Based on asymptotic properties, consistency of the SMHD estimators hold in general but high efficiencies of SMHD estimators can only be guaranteed if the Fisher information matrix of the parametric exists, a situation which is similar to likelihood estimation. One can also viewed the estimators are fully efficient within the class of simulated estimators obtained with the model pmf being replaced by a simulated version. Chi-square goodness of fit test statistics are constructed in Section 2.3. For the ungrouped case, it can be seen as having grouped data but the number of intervals with unit length and the number of intervals is infinite, it is given in section 3 where the ungrouped SMHD estimators are shown to have good efficiencies. The breakdown point for the SMHD estimators remains
at least
included in section 4. First, we consider the Neymann type A distribution and compare the efficiencies of the SMHD estimators versus moment (MM) estimators, simulations results appear to confirm the theoretical results showing that the SMHD estimators are more efficient than the MM estimators based on matching the first two empirical moments with their model counterparts for a selected range of parameters. The Poisson distribution is considered next and the study shows that despite being less efficient than the ML estimator, the efficiency of the SMHD estimators remain high and the estimators are far more robust than the ML estimator in the presence of outliers just as in the deterministic case as shown by Simpson [
Pakes and Pollard [
or
where
Theorem 3.3 given in Pakes and Pollard [
and for HD distance, version D, let
and for version S, let
which can be reexpressed as
In general, the intervals
with
for continuous distribution with support of the entire real line used in financial
study, we might let
Let
of the true parameters is denoted by
We define MHD estimators as given by the vector
Theorem 1 (Consistency)
Under the following conditions
a)
b)
c)
Theorem 3.1 states condition b) as
An expression is
Therefore, for both versions of
Asymptotic normality is more complicated in general. For the grouped case, Theorem 3.3 given by Pakes and Pollard [
Since the proofs have been given by the authors, we only discuss here the ideas of their proofs to make it easier to follow the results of Theorem 2 and Corollary 1 in Section (2.2.2).
For both versions,
not differentiable for version S, the traditional Taylor expansion argument cannot be used to establish asymptotic normality of estimators obtained by minimizing
Let
Under these circumstances, it suffices to work with
This condition is used to formulate Theorem 2 below and is slightly more stringent than the condition iii) of their Theorem 3.3 but it is less technical and sufifcient for SMHD estimation. Clearly, for SMHD estimation
In this section, we shall state a Theorem namely Theorem 2 which is essentially Theorem 3.3 by Pakes and Pollard [
Theorem 2
Let
Under the following conditions:
1) The parameter space Ω is compact,
2)
3)
4)
5)
6)
Then, we have the following representation which will give the asymptotic distribution of
or equivalently, using equality in distribution,
The proofs of these results follow from the results used to prove Theorem 3.3 given by Pakes and Pollard [
Corollary 1.
Let
The matrices
We observe that condition 4) of Theorem 2 when applies to Hellinger distance or in general involve technicalities. The condition 4) holds for version D, we only need to verify for version S. Note that to verify the condition 4, it is equivalent to verify
and for version S of Hellinger distance estimation, let
and for the grouped case, it is given by
We need to verify that we have the sequence of functions
Note that
We shall outline the approach by first defining the notion of continuity in probability and let
Now as
details of these arguments are given in technical appendices TA1.1 and TA1.2 at the end of the paper, in the section of Appendices.
The notion of continuity in probability has been used in a similar context in the literature of stochastic processes, see Gusalk et al. [
used previously by other approaches to establish
Definition 1 (Continuity in probability)
A sequence of random functions
result of continuity in real analysis. It is also well known that the supremum of a continuous function on a compact domain is attained at a point of the compact domain, see Davidson and Donsig [
is given by
might be more precise to use the term sequence of random functions rather than just random function here for the notion of continuity in probability as the random function will depend on n.
Below are the assumptions we need to make to establish asymptotic normality for SMHD estimators and they appear to be reasonable.
Assumption 1
1) The pmf of the parametric model has the continuity property with
2) The simulated counterpart has the continuity in probability property with
3)
In general, the condition 2) will be satisfied if the condition 1) holds and implicitly we assume the same seed is used for obtaining the simulated samples across different values of
Definition 2 (Differentiability in probability)
The sequence of random functions
differentiability in real analysis for nonrandom function.
A similar notion of differentiability in probability has been used in stochastic processes literature, see Gusak et al. [
following assumption for
Assumption 2
This assumption appears to be reasonable, this can be checked by using limit operations as in real analysis with
Since regularity conditions for Theorem 2 and its corollary can be met and they are justified in TA1.1 and TA1.2 in the Appendices, we proceed here to find the asymptotic covariance matrix
Since
from the sample given by the data, so we can focus on version D and make the adjustment for version S. We need the asymptotic covariance matrix
vector
Recall that form properties of the multinomial distribution, the covariances of
and
The covariance matrix of
and the asymptotic covariance matrix of
We then have the vector of HD estimators version D and S given respectively by
so
with
Therefore for version S,
the simulated sample size is
Note that for version D, the HD estimators are as efficient as the minimum chi-square estimators or ML estimators based on grouped data. The overall asymptotic relative efficiency (ARE) between version D and S for HD estimation is
simply ARE =
An estimate for the covariance matrix
The asymptotic covariance matrix of
For ungrouped data and for version D, it is equivalent to choose
and
Section 3 may be skipped for practitioners if their main interests are only on applications of the results.
In this section, the Hellinger distance
H0: data comes from a specified distribution with distribution
The version S is of interest since it allows testing goodness of fit for discrete or continuous distribution without closed form pmfs or density functions, all we need is to be able to simulate from the specified distribution. We shall justify the asymptotic chi-square distributions given by expression (23) and expression (24) below.
Note that
and
for version D. For version S,
Using standard results for distribution of quadratic forms and the property of the operator trace of a matrix with
Just as the chi-square distance, the Hellinger distance
H0: data comes from a parametric model
for version D and for version S,
where
which can be reexpressed as
or
With
by noting
using
For version S,
and
with
For the classical version D with ungrouped data, Simpson [
be the vector of partial derivatives with respect to
with
score functions with covariance matrix
For version D, we then have
or equivalently
Therefore, we can conclude that
the result of Theorem 2 given by Simpson [
For version S with ungrouped data, it is more natural to use Theorem 7.1 of Newey and McFadden [
with its equivalent expression given by expression (3).
Also, if
The vector
If the remainder of the approximation is small, we also have
Before defining the remainder term
as
For the approximation to be valid, we define
and requires
proofs of Theorem 7.1 given by Newey and McFadden. The following Theorem 3 is essentially Theorem 7.1 given by Newey and McFadden but restated with estimators obtained by minimizing an objective function instead of maximizing an
objective function and requires
more stringent than the original condition v) of their Theorem 7.1. We also require compactness of the parameter space
Theorem 3
Suppose that
1)
2)
3)
4)
5)
The regularity conditions (1-3) of Theorem 3 can easily be checked. The condition 4 follows from expression (27) established by Simpson [
The objective function
the matrix of second derivative of
and it can be seen that
by performing limit operations to find derivates as in real analysis and using Assumption 1 and Assumption 2. Therefore, we have the following equality in distribution using the condition 4) of Theorem 3 and expression (27)
which is similar to the grouped case.
Now with
One might want to define the extended Cramér-Rao lower bound for simulated method estimators to be
other simulated methods, it can be interpreted as the adjustment factor when estimators are obtained via minimizing a simulated version of the objective function instead of the original objective function with the model distribution being replaced by a sample distribution using a simulated sample, see Pakes and Pollard [
We close this section by showing the asymptotic breakdown point
see Simpson [
Let
Now with the same seed used across
but
as
With
using the inequality
The last inequality follows from the assumption that
Using
probability under the true model which is similar to version D. The only difference is here we have an inequality in probability. From this result, we might conclude that the SMHD estimators preserve the robustness properties of version D and the loss of asymptotic efficiency comparing to version D can be minimized if
Once the parameters are estimated, probabilities can be estimated. For situations where recursive formulas exist then Panjer’s method can be used, see Chapter 9 of the book by Klugman et al. [
In this section, we discuss some methods for approximating probabilities
See Butler [
The saddlepoint
If the cumulant function does not exist, an alternative method which is based on the characteristic function, as described by Abate and Whitt [
As an example for illustration we choose the Neymann Type A distribution with the method of moments (MM) estimators for
by Johnson et al. [
with the sample mean and variance given respectively by
For the range of parameter values, we let
efficiency used is the ratio
the mean square error of the estimator inside the parenthesis. The mean square error of an estimator
The ratio ARE can be estimated using simulated data and they are displayed in
For the Poisson model with parameter
using the ratio
model, the information matrix exists and we can check the efficiency and robustness of the SHD estimator and compare it with the ML estimator which is the sample mean. Since there is only on parameter estimate we are able to fix
30 | 40 | 50 | 60 | 80 | 100 | |
---|---|---|---|---|---|---|
0.25 | 0.0032 | 0.0082 | 0.0238 | 0.0173 | 0.0074 | 0.0063 |
0.5 | 0.0523 | 0.0024 | 0.0148 | 0.2053 | 0.0115 | 0.0429 |
1 | 0.0337 | 0.0256 | 0.1253 | 0.1502 | 0.0892 | 0.0481 |
2 | 0.0073 | 0.0197 | 0.0393 | 0.0536 | 0.2986 | 0.0147 |
3 | 0.0038 | 0.0046 | 0.0020 | 0.3167 | 0.0229 | 0.0057 |
4 | 0.0098 | 0.0103 | 0.0117 | 0.0156 | 0.0102 | 0.0020 |
5 | 0.0481 | 0.1431 | 0.0062 | 0.0073 | 0.0100 | 0.0009 |
6 | 1.0330 | 0.0632 | 0.0145 | 0.0236 | 0.0126 | 0.0062 |
Asymptotic relative efficiency between MLE
Asymptotic relative efficiency between MLE
U = 10000 for the simulated sample size from the Poisson model without slowing down the computations. It appears overall the SHD estimators performs very well for the range of parameters often encountered in actuarial studies, here we observe that the asymptotic efficiencies range from 0.7 to 1.1. We also study a contaminated Poisson model (
the SMHD estimator vs ML estimator in presence of contamination. The sample mean looses its efficiency and becomes very biased. The results are given at the bottom of
More simulation experiments to further study the performance of the SMHD estimators vs commonly used estimators across various parametric models are needed and we do not have the computing facilities to carry out such large scale studies. Most of the computing works were carried out using only a laptop computer. So far, the simulation results confirm the theoretical asymptotic results which show that SMHD estimators have the potential of having high efficiencies for parametric models with finite Fisher information matrices and they are robust if data is contaminated; the last feature might not be shared by ML estimators.
The helps received from the Editorial staffs of OJS which lead to an improvement of the presentation of the paper are gratefully acknowledged.
Luong, A., Bilodeau, C. and Blier-Wong, C. (2018) Simulated Minimum Hellinger Distance Inference Methods for Count Data. Open Journal of Statistics, 8, 187-219. https://doi.org/10.4236/ojs.2018.81012
TA 1.1
In this technical appendix, we shall show that a sequence of random functions
TA 1.2
In this technical appendix, we shall show that the sequence of function
The first two terms of the RHS of the above equation are bounded in probability as they have a limiting distributions and this implies the third term is also bounded in probability by using Cauchy-Schwartz inequality. Now using the conditions of Assumption 1 of Section (2.2.2) and implicitly the assumption of the same seed is used across different values of
and
From the above property, it is clear that
belongs to
The justifications for the ungrouped case are similar using the same type of arguments but with the use of Theorem 7.1 given by Newey and McFadden [
In this technical appendix we shall verify the condition
hence the use of DCT is justified. Therefore,
Now if we define