Open Journal of Statistics
Vol.1 No.3(2011), Article ID:8070,13 pages DOI:10.4236/ojs.2011.13021
A Kullback-Leibler Divergence for Bayesian Model Diagnostics
1Department of Epidemiology and Biostatistics, University of Texas Health Science Center, San Antonio, USA
2Department of Statistics, University of Florida, Gainesville, USA
E-mail: wangc3@uthscsa.edu, ghoshm@stat.ufl.edu
Received August 18, 2011; revised September 19, 2011; accepted September 29, 2011
Keywords: Kullback-Leibler Distance, Model Diagnostic, Weighted Posterior Predictive p-Value
Abstract
This paper considers a Kullback-Leibler distance (KLD) which is asymptotically equivalent to the KLD by Goutis and Robert [1] when the reference model (in comparison to a competing fitted model) is correctly specified and that certain regularity conditions hold true (ref. Akaike [2]). We derive the asymptotic property of this Goutis-Robert-Akaike KLD under certain regularity conditions. We also examine the impact of this asymptotic property when the regularity conditions are partially satisfied. Furthermore, the connection between the Goutis-Robert-Akaike KLD and a weighted posterior predictive p-value (WPPP) is established. Finally, both the Goutis-Robert-Akaike KLD and WPPP are applied to compare models using various simulated examples as well as two cohort studies of diabetes.
1. Introduction
Information theory provides a general framework of developing statistical techniques for model comparison (Akaike [2]; Shannon [3]; Kullback and Leibler [4]; Lindley [5]; Bernardo [6]; Schwarz [7]). The KullbackLeibler distance (KLD) is perhaps the most commonly used information criterion for assessing model discrepancy (Akaike [2]; Kullback and Leibler [4]; Lindley [5]; Schwarz [7]). In essence, a KLD is the expected logarithm of the ratio of the probability density functions (p.d.f.s) of two models, one being a fitted model and the other being the reference model, where the expectation is taken with respect to the reference model. Thus KLD can be viewed as a measure of the information loss in the fitted model relative to that in the reference model. KLDs that are suitable for model comparison in the Bayesian framework typically involve the integrated likelihoods of the competing models, where the integrated likelihood under each model is obtained by integrating the likelihood with respect to the prior distribution of model parameters (e.g., Lindley [5] and Schwarz [7]). KLDs based on the ratio of integrated likelihoods however have been challenged by identifying priors that are compatible under the competing models and that the resulting integrated likelihoods are proper. As a way to overcome the challenges associated with prior elicitation in calculating KLD under the Bayesian framework, one may consider the Bayesian estimate of the KullbackLeibler projection by Goutis and Robert [1], henceforth G-R KLD. More specifically, for a given reference model indexed by parameter(s), the G-R KLD is the infimum KLD between the likelihood under the reference model and all possible likelihoods arising from the competing fitted model. Thus if the reference model is correctly specified, then the G-R KLD is asymptotically equivalent to the KLD between the reference model and the competing fitted model evaluated at its MLE (ref. Akaike [2]). The Bayesian estimate of G-R KLD is obtained by integrating the G-R KLD with respect to the posterior distribution of model parameters under the reference model. First, the Bayesian estimate of G-R KLD is clearly not subject to the drawback due to impropriety of the prior as long as the posterior under the reference model is proper. Second, G-R KLD is suitable for comparing the predictivity of the competing models since it is calculated with respect to the posterior density of model parameters under the reference model. However, the G-R KLD was originally developed for comparing nested generalized linear models while assuming a known true model, and its extension to general model comparison remains limited. For example, if the reference model is not correctly specified, then the G-R KLD is not necessarily reduced to to the KLD between the reference model and the competing fitted model evaluated at its maximum likelihood estimate or MLE (ref. Akaike [2]), referred to as the Goutis-Robert-Akaike KLD or G-R-A KLD a more tractable model discrepancy measure.
This paper proposes to use G-R-A KLD for assessing model discrepancy in terms of the fit of certain statistics that is central to our inference or model diagnostic purpose. That is, we evaluate the G-R-A KLD between the probability density function (p.d.f.) of
under the reference model
and that under the assumed model
evaluated at its MLE (see Section 2). We investigate the (asymptotic) property of G-R-A KLD under certain regularity conditions as well as under the violation of some regularities, including non-nested models. Note that unlike G-R KLD (Goutis and Robert [1]), the G-R-A KLD considered herein does not require the reference model
to be the true model, nor is the true model to be specified. Also, while G-R KLD has been limited to comparing nested generalized linear models, the G-R-A KLD seems to be more flexible for comparing nested or non-nested models that are broader than generalized linear models (see Sections 3 and 4). Theorem 1 shows that under certain regularity conditions, the asymptotic expression of the posterior estimator of G-R-A KLD is comprised of a leading term for model discrepancy in the mean of
, a term for model discrepancy in the variance of
, and a constant to penalize model complexity, plus a smaller order term. Since the first two leading terms in the G-R-A KLD estimator resemble a measure that differentiates the predictability between models
and
, it is natural to study its connection with Bayesian model discrepancy measures based on predictive statistics (Guttman [8], Rubin [9], Gelman et al. [10]). In particular, we consider the posterior predictive check technique using a one-sided weighted posterior predictive p-value (WPPP). The WPPP of
evaluates the predictive distribution of
under
at
, where
denotes the prediction of
under
, the posteriors are derived under
, and the weight is used to account for the variation of
under
. Theorem 2 explicitly shows that for any
satisfying certain asymptotic normality and regularity conditions, how the model discrepancy is reflected by the G-R-A KLD in connection to that by WPPP. To verify the results in Theorem 1 and Theorem 2 as well as to evaluate G-R-A KLD under partial violation of the regularity conditions, we examine the (asymptotic) property of G-R-A KLD and WPPP via both simulations and real data applications motivated by two cohort studies of diabetes. These examples include the comparison between nested models as well as non-nested models.
The paper is organized as follows. Section 2 studies the G-R-A KLD for (predictive) p.d.f.s of between two competing models. It also derives the relationship between G-R-A KLD and WPPP. Sections 3 and 4 evaluate model fit using both G-R-A and WPPP for examples that meet all the regularity conditions required in Theorems 1 and 2 as well as for examples that meet only part of these regularity conditions.
2. A Proposed Kullback-Leibler Divergence
2.1. Notations
Throughout this paper, we assume that’s originate from model
, and are i.i.d. with some common p.d.f. with parameter(s)
for
, where
is a closed set. Denote
for the reference model and
for the fitted model, both governed by
, where
and
are the corresponding the parameter spaces. Also, when a capital letter is used to denote for a random variable (or an estimator), the corresponding lower case is for its realization (or an estimate). Let
be the base for deriving the posterior density of
under model
. We shall denote
and
for the p.d.f.’s of * given
under model
and
, respectively. Also, let
where
is l-dimensional.
2.2. Define Kullback-Leibler Divergence
Consider that model adequacy is evaluated based on its fit for certain statistics that is pertinent to the inference or model diagnostics. Stemmed from Goutis and Robert [1], we assess the relative fit between models using the KLD of the distribution of
under
and that under
evaluated at
, the MLE of
:
, (1)
We shall refer (1) to as the Goutis-Robert-Akaike KLD or G-R-A KLD since (1) is asymptotically equivalent to the KLD proposed by Goutis and Robert [1] when the reference model is the true model (ref. Akaike [2]).
In general, for each, G-R-A KLD given in (1) can be regarded as a measure of the minimal information gain in model
from model
since the minimum information loss under f is achieved at
. Note that unlike G-R KLD (Goutis and Robert [1]), (1) by definition does not require the reference model
to be the true model. To understand the utility of (1) in the Bayesian framework, below we derive its Bayesian estimate and the associated (asymptotic) property under certain regularity conditions (see Sections 2.3 and 2.4). We also study the property of (1) when partial regularity conditions hold true using simulated data (Section 3) as well as two applications of diabetes studies (Section 4).
2.3. Bayesian Estimation of the Proposed KLD
Estimating (1) involves approximating the integral with respect to and estimating unknown model parameters
. To account for the uncertainty of model parameters, we consider the following estimator:
, (2)
where, as a function of
, is used for deriving the posterior of
, and
denoting the posterior density of
under model
. That is, (2) is the average discrepancy between
and
, each being weighted by
. We shall denote (2) by
. Since (2) is nonnegative for any given
, the closer it is to zero, the less is the information loss by fitting
, instead of
, to statistic
.
To gain further insight about the utility of (2), we derive its asymptotic properties below. The approximation of (2) is tied to the assumptions of under
and
, which will be described in Theorems 1 and 2. In what follows, let the
statements be interpreted as “almost sure” statements. Also, define
for
which has a unique minimum attained at
. Denote
for model parameters, and
,
,
and
for the posterior means (or the MLE’s) of
,
,
, and
, respectively.
Assume the following regularity conditions under Theorems 1 and 2.
(A1) For each, both
and
are 3 times continuously differentiable in
. Further, there exist neighborhoods
and
of
and integrable functions
and
such that
and
for k = 1, 2, 3.
(A2) For all sufficiently large,
and
.
(A3)
as, and
as.
(A4) The prior density is continuously differentiable in a neighborhood of
and
.
(A5) Let be asymptotically normally distributed under both models such that
(3)
and
. (4)
Theorem 1.
(5)
when, and
(6)
when but
.
The proof of Theorem 1 is given in Appendix 1. Since model comparison in real applications can rely on the relative fit to a multi-dimensional statistic, it is useful to study whether the results in Theorem 1 are applicable to the multivariate case with a fixed dimension. Suppose that is a p-dimensional statistic (
and independent of n) with
and
Then following the derivation given in Appendix 1, it can be shown that
when, and
when and
.
The result above implies the importance of choosing to yield an effective model diagnostic based on
. For example,
is typically chosen to be the sufficient statistic for
since it contains all the information of
. Yet,
clearly is not the best diagnostic statistic if the estimates of the mean of
are unbiased under both models
and
as
. Also, note that both
in (5) and
in (6) can be viewed as a discrepancy between
and
in terms of their posterior predictivity of
. We show next how
is related to a weighted posterior predictive p-value, a typical approach for assessing model discrepancy regarding the predictivity of
in the Bayesian framework (see Rubin [9]; Gelman et al. [10]).
2.4. KLD vs Posterior Predictive p-Value
Consider
, (7)
where and
are the density functions of
under
and
, respectively. We shall call (7) a onesided weighted posterior predictive p-value (or WPPP) with respect to model
since as shown above, it is equivalent to the mean predictive p-value of
under
over all possible posterior predicted values of
arising from
. Thus WPPP can be viewed as a model discrepancy measure since it evaluates the predictivity of
under
. Note that WPPP proposed here differs from the posterior predictive p-value originally proposed in Rubin [9]: WPPP calculates the predictive p-value of
under an assumed model
should
originate from the reference model
, while Rubin's original proposal [9] assesses the predictive p-value of
under an assumed model. WPPP is considered here since it is more coherent with the nature of G-R-A KLD as a discrepancy measure between models
and
.
The connection between and
is derived in Theorem 2 below. Here we assume regularity conditions (A1)-(A5) as assumed in Theorem 1.
Theorem 2.
(8)
when; and
(9)
and
(10)
when but
.
The proof of the theorem is given in Appendix 2.
Remark. Theorem 2 explicitly shows the asymptotic relationship between and
. Suppose that
(i.e., the estimate of the mean of
differs between
and
). Then both
and
are of order
. Suppose
but
(i.e., the mean of
is the same both
and
, but the variance of
differs between
and
). Then
converges to 0, while
converges to a positive quantity of
.
3. Illustrative Examples
This section demonstrates the utility of the
discussed previously through two sets of examples. Examples 3.1-3.6 demonstrate simulated examples that confirm the results proved in Theorems 1 and 2. All these six examples meet the regularity conditions required, while Examples 3.3, 3.5, and 3.6 involve comparing non-nested models. Example 3.7 studies the departure from Theorems 1 and 2 due to violation of the regularity condition (A5). We let
in all calculations below.
Example 3.1. (Nested) Assume
where
. Let
. Suppose that
and
for some known. Then
,
,
,
and
.
That is, converges in probability to a positive quantity, which suggests the discrepancy between
and
. However the magnitude of the model discrepancy assessed by
depends on
. In contrast,
converges to 0.5, indicating no difference between
and
. Thus the results are consistent with Theorems 1 and 2.
Example 3.2. (Nested) Assume
where
,
, and
is a 2 × 2 matrix with
and
. Suppose that
and
where
,
, and both
and
are 2 × 2 matrices with
,
,
, and
.
Let. Then
,
and
.
It can be shown that,
,
and
where
,
,
.
Thus
.
Since and
, following the arguments given in Appendix 2, we have
Thus both
and the square of the probit transformed
suggest the discrepancy between
and
by an
term.
Example 3.3 (Non-nested). Assume
where
. Let
,
, and
. Then
,
, and
.
In this case, has the same statistical meaning under both
and
since
. Yet the substantive meaning of
under
differs from that under
. It can be shown that
, and
for. Consistent with Theorem 2,
converges in probability to a positive quantity (i.e.,), and
converges in probability to 0.5 (since
).
Examples 3.4-3.6 below consider situations where is an unbiased estimate for the parameter of interest under both
and
. However
is the MLE of certain parameter under
but not under
. In these examples, since MLEs of the mean of
under both
and
converge to the same in probability,
is reduced to
in these examples.
Example 3.4 (Nested). As an extension of Example 3.1, consider
where
,
,
for
,
with
. Let
,
and
for some. Thus
,
and
for
where
. Here
is the MLE of
under, while the MLE of
under
is
where is the MLE of
under
. Also,
.
We conduct numerical calculation for the following combination,
with
and
. Since the 95% percentiles of
do not exceed 1 under all situations (hence
, it implies that in the ory KLD should be able to distinguish
from
. However, the numerical values of KLD only deviate from 0 by
for
. Thus in practice, KLD can hardly be differentiated from
which converges to 0.5 in probability
.
Example 3.5 (Non-nested). Assume
where. Let
. Suppose that
are fitted by
and
. Then
,
and
. Since
for all with
converging to 0 with probability 1 if and only if
,
can detect model discrepancy in the dispersion parameter. In contrast,
converges to 0.5 in probability, which is not sensitive to the discrepancy in the dispersion parameter.
Example 3.6 (Non-nested). Assume
where,
, and
. Suppose that the empirical variance of
is greater than 1. Consider fitting the data by
or
where
. Let
. Then
and
, where
Since both and
are unbiased estimators for
and
converges to 0 in probability,
where
, while
can not be obtained in a closed form. Thus
is evaluated numerically using eight combinations given in the table below. The results suggest that the degree to which
can distinguish
from
varies by situation: closer
and
are to 0, closer is
to 0. In contrast,
converges to 0.5 in probability, (see Table 1).
Example 3.7 below shows that the asymptotic relationship between and
does not hold in the sense of Theorem 2 due to violation of the asymptotic normality assumption specified in (3) a and (4).
Table 1. Simulation result of example 3.6.
Example 3.7 (Nested). Consider
.
Suppose that are fitted by
and
. Let
. Then
and
.
Then
and
That is,
and
.
In this example, both and
can differentiate
from
by an
term despite that the asymptotic relationship between
and
does not hold in the sense of Theorem 2 due to the violation of the asymptotic normality assumption.
4. Application of to Diabetes Studies
In this section, we apply both and
to compare non-nested models in two studies of diabetes. In these applications, we assess the model fit to the entire dataset due to its clinical implication. Here the selected diagnostic statistic is a multivariate statistic of
variables, and it does not meet all the regularity conditions specified in Section 3.
4.1. Study I: Analysis of Change in Glucose in Veterans with Type 2 Diabetes
Study I originated from a clinical cohort of 507 veterans with type 2 diabetes who had poor glucose control (indicated by glycosylated hemoglobin A1c or HbA1c greater than 6.5) at the baseline (fiscal year 1999), and were all treated by metformin as the mono oral glucose-lowering agent. As the literature suggested that the glucose-lowering response due to metformin may vary by an individuals obesity status, the goal of our study was to compare models that assessed whether obesity was associated with the net change in glucose level between baseline and the end of 5-year follow-up. In this study, the empirical mean of the net change in HbA1c over the 5 year period was similar between the obese vs. non-obese groups (–0.498 vs. –0.379), yet the empirical variance was greater in the obese group (1.207 vs. 0.865). Also, the distribution of HbA1c was reasonably symmetric. Thus we considered two candidate models for fitting the HbA1c change: a mixture of normals vs. a t-distribution (note that the overall empirical variance of HbA1c change is 1.03); that is,
where,
, and
(% in the obesity group) vs.
where. The calculated
was 5.96, suggesting that model
provided a modest better fit to the data compared to model
. This result was also consistent with Figures 1 and 2 which contrasted the empirical quantiles with predicted quantiles under
and. Note that both
and
are unbiased estimators for
, where
and
with
Thus the model discrepancy assessed by
is primarily attributed to the difference in the variance assumption between
and
(as evident in Figures 1 and 2). In contrast, WPPP = 0.522 suggested that the overall fit were similar between the two models since the estimated net change in HbA1c was similar between these two models.
4.2. Study II: Analysis of the Functioning Score in Older Adults with Diabetes
Study II arose from the subset of 119 participants with diabetes in the San Antonio Longitudinal Study of Aging (SALSA), a community-based study of the disablement process in Mexican American and European American older adults. Details of the SALSA study design, sampling approach, recruitment and field procedures have been described previously (Hazuda et al. [11]). The goal of our analyses was to compare models that assessed whether glucose control trajectory class (poorer vs. better) was associated with the lower-extremity physical functional limitation score during the first follow-up period. The lower-extremity physical functional limitation score was measured by the Short Physical Performance Battery or SPPB, which is a well-established, validated measure of physical functioning. SPPB score is a sum of three items: 8-foot walking times, repeated chair stands, and balance scores, each being a 5-point liker scale 0 - 4. Hence the SPPB score is discrete in nature with a range of 0 - 12. Higher SPPB scores indicate better performance and less functional limitation. Exploratory data analyses suggested that the empirical variance of SPPB (15.60 vs. 14.33) was greater than the mean (7.23 vs. 8.02) in both glucose control classes. Also, due to the
Figure 1. Quantile plot for HbA1c for T2DM participants with obesity in VA (open triangle: quantile estimates based on the model assuming a mixture of normal distributions solid circle: quantile estimates based on the model assuming a t-distribution solid line: empirical quantiles).
Figure 2. Quantile plot for HbA1c for T2DM participants without obesity in VA (open triangle: quantile estimates based on the model assuming a mixture of normal distributions solid circle: quantile estimates based on the model assuming a t-distribution solid line: empirical quantiles).
left-skewedness of the distribution of SPPB, we considered models fit to the reversed SPPB, i.e.,. Given the nature of SPPB distribution, we compare two candidate models:
with and
.
The calculated is 32.63, suggesting that
has a much better fit than
, which is coherent to the quantile plot shown in Figures 3 and 4. Since both
and
yielded similar estimates of
, the model discrepancy assessed by
could primarily be attributed to the difference in variance estimation between
and
(as evident in Figure 2). In contrast, WPPP = 0.539 suggested similar fit between
and
as expected due to similar estimates of
under both
and
.
5. Discussion
This paper considers the G-R-A KLD as given in (2). This KLD is appropriate for quantifying information discrepancy regarding contained in the competing models
and
. We derive the asymptotic property of the G-R-A KLD in Theorem 1, and its relationship to a weighted posterior predictive p-value (WPPP) in Theo-
Figure 3. Quantile plot for SPPB for T2DM participants with good glycemic control in SALSA (open triangle: quantile estimates based on the negative binomial model; solid circle: quantile estimates based on the Poisson model; solid line: empirical quantiles).
rem 2. However, our results would need further refinement when the normality assumptions given in (3) and (4) are not suitable (see Example 3.7). As shown in Section 4, model comparison in medical research may rely on the fit to a multidimensional statistic. Although the results in Theorem 1 holds for a multivariate with a fixed dimension, further investigation is needed to assess the property of our proposed KLD for situations when the dimension of
increases with
. The KLD proposed herein usually provides the relative fit between competing models. Thus, for the purpose of assessing model adequacy (rather than relative model fit), a KLD should be used in conjunction with absolute model departure indices such as posterior predictive p-values or residuals. Nevertheless, a KLD can be a measure of the absolute fit of model
when the superior model
is the true model, and therefore can be used for checking model adequacy.
6. Acknowledgements
Wang’s research is partially supported by NIDDK K25- DK075092; Ghosh’s research is partially supported by NSF. We thank Dr. Hazuda for providing data from the SALSA study supported by NIA R01-AG10444 and NIA R01-AG16518.
Figure 4. Quantile plot for SPPB for T2DM participants with poor glycemic control in SALSA (open triangle: quantile estimates based on the negative binomial model; solid circle: quantile estimates based on the Poisson model; solid line: empirical quantiles).
7. References
[1] C. Goutis and C. P. Robert, “Model Choice in Generalised Linear Models: A Bayesian Approach via KullbackLeibler Projections,” Biometrika, Vol. 85, No. 1, 1998, pp. 29-37. doi:10.1093/biomet/85.1.29
[2] H. Akiaike, “A New Look at the Statistical Identification Model,” IEEE Transactions on Automatic Control, Vol. 19, No. 6, 1974, pp. 716-723. doi:10.1109/TAC.1974.1100705
[3] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, Vol. 27, 1948, pp. 379-423 and pp. 623-656.
[4] S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, Vol. 22, No. 1, 1951, pp. 79-86. doi:10.1214/aoms/1177729694
[5] D. V. Lindley, “On a Measure of the Information Provided by an Experiment,” The Annals of Mathematical Statistics, Vol. 27, No. 4, 1956, pp. 986-1005. doi:10.1214/aoms/1177728069
[6] J. M. Bernardo, “Expected Information as Expected Utility,” The Annals of Statistics, Vol. 7, No. 3, 1979, pp. 686-690. doi:10.1214/aos/1176344689
[7] G. Schwarz, “Estimating the Dimension of a Model,” The Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464. doi:10.1214/aos/1176344136
[8] I. Guttman, “The Use of the Concept of a Future Observation in Goodness-of-Fit Problems,” Journal of the Royal Statistical Society B, Vol. 29, No. 1, 1967, pp. 83-100.
[9] D. B. Rubin, “Bayesianly Justifiable and Relevant Frequency Calculations for the Applies Statistician,” Annals of Statistics, Vol. 12, No. 4, 1984, pp. 1151-1172. doi:10.1214/aos/1176346785
[10] A. Gelman, J. Carlin, H. S. Stern and D. Rubin, “Bayesian Data Analysis,” Chapman and Hall, London, 1996.
[11] H. P. Hazuda, S. M. Haffner, M. P. Stern and C. W. Eifler, “Effects of Acculturation and Socioeconomic Status on Obesity and Diabetes in Mexican Americans: The San Antonio Heart Study,” American Journal of Epidemiology, Vol. 128, No. 6, 1988, pp. 1289-1301.
[12] S. Ghosal and T. Samanta, “Expansion of Bayes Risk for Entropy Loss and Reference Prior in Nonregular Cases,” Statistics and Decisions, Vol. 15, 1997, pp. 129-140.
[13] I. Ibragimov and R. Hasminskii, “Statistical Estimation: Asymptotic Theory,” Springler-Verlag, New York, 1980.
Appendix 1. Proof of Theorem 1
Under (3) and (4), it can be shown that
Applying an argument similar to that in Ghosal and Samanta [12] (see also Ibragimov and Hasminskii [13]), (A1)-(A5) gives
(11)
When,
is the dominating term in (11). Thus by (A1)-(A5) and the arguments similar to those in Ghosal and Samanta [12], one gets
(12)
If,
, then (11) becomes
(13)
Therefore,
when, and
.
when and
.
Appendix 2. Proof of Theorem 2
Write for the realization of
, a
random variable. Then
The second equality follows (A1)-(A5) and the argument similar to that used in Ghosal and Samanta [12]. Let be a
random variable distributed independently of
. Then one can rewrite the above integral as
Therefore, applying the arguments similar to those in Ghosal and Samanta [12] under (A1)-(A5) yields
(14)
when; and
(15)
when. This completes the proof.