The conventional form of statistical simulation proceeds by selecting a few models and generating hundreds or thousands of data sets from each model. This article investigates a different approach, called BayesSim, that generates hundreds or thousands of models from a prior distribution, but only one (or a few) data sets from each model. Suppose that the performance of estimators in a parametric model is of interest. Smoothing methods can be applied to BayesSim output to investigate how estimation error varies as a function of the parameters. In this way inferences about the relative merits of the estimators can be made over essentially the entire parameter space, as opposed to a few parameter configurations as in the conventional approach. Two examples illustrate the methodology: One involving the skew-normal distribution and the other nonparametric goodness-of-fit tests.
When investigating statistical methodology using simulation, the following strategy is almost always used. Choose several completely specified statistical models, generate hundreds or thousands of data sets from each model, apply the methodology to every data set, and then summarize the results. Usually some effort is made to choose models that represent a variety of model types, although this is often hard to do when only four or five models are considered. Having selected models, one hopes to show that one method is better than another for most if not all the models considered. However, even when Method A works better than Method B in all cases considered, there is a nagging doubt that one has failed to look at all the relevant models, and there might still be cases where B is better than A.
In this article a different simulation strategy is advocated. The idea is to generate one data set (or at most a few data sets) from each of hundreds or thousands of models that are randomly selected from a prior. We call such an approach BayesSim. Given an appropriate loss function, one may generate BayesSim output to estimate the Bayes risk of each method of interest and then use these estimates as at least part of a basis for choosing amongst methods. Such an approach was used in a simulation study of [
The basic idea considered in this article is not new. It was proposed by [
・ A main purpose of the current article is to bring to the attention of statisticians methodology that seems to be unfamiliar to them, but which nonetheless could make their simulations more efficient and informative.
・ The articles of [
・ The work of [
・ The one idea that may be “new” in the current article is the idea of using nonparametric smoothing to analyze BayesSim output. If one opposes the notion of using Bayes principle (averaging all results with respect to a prior) to compare various methods, one may use smoothing in conjunction with BayesSim to compare methodology in the same “local” way that one does in a conventional simulation, but with the advantage that broader conclusions may be drawn.
There are two main considerations in the motivation for BayesSim: simulation efficiency and diversity of models considered. Generating thousands of data sets for each of two or three different models often seems like overkill for those few cases, and is unsatisfying from the standpoint of model diversity. In other words, one learns a lot about very little. In the author’s opinion a much better under- standing of the general performance of a method can be obtained with BayesSim, and this can be done without generating any more data sets than in the trad- itional approach. This notion can be easily quantified. Suppose one uses k parameter configurations in a conventional approach and generates n obser- vations from each configuration. In BayesSim one may generate the same number of data sets by generating n k parameter values from a prior, and then one data set from each of the n k parameter values. Now, let R ( θ ) be the mean squared error of an estimator when the model parameter is θ . One may compare the conventional and BayesSim approaches by asking how efficiently each one can estimate R ( ⋅ ) using its information.
Another reason that BayesSim seems like a good idea is that it has the potential of actually simulating the average performance of a method that is used at a variety of times and places throughout the world. Arguably this should be the goal of a simulation study that is trying to make a case for a new method. If the prior used in BayesSim well-approximates the empirical distribution of models or parameter values encountered in practice, then BayesSim will achieve this goal. This suggests a new avenue of research: studying the empirical dis- tributions of various types of parameters, perhaps by the use of meta analyses.
It is worth remarking that the proposed methodology is not Bayesian per se. Nonetheless, the name BayesSim seems appropriate on two counts. First of all, a good choice of prior distribution in BayesSim is based on the same con- siderations used in choosing an objective prior in a Bayesian analysis. Secondly, we advocate the use of Bayes principle, i.e., averaging results with respect to the prior, as a means of comparing different methods.
A subtheme of the article is the contention that statisticians seldom use the sophisticated methods they develop to analyze simulation data. In the author’s opinion the simulations that are an ever-present part of methodological papers would be more interesting and informative if they employed more of the vast array of new statistical methods. For example, in comparing estimation methods for multiparameter models, one could estimate performance measures as a function of the parameter vector by using additive nonparametric regression methods.
Before giving a more detailed description of our method we present a toy example that illustrates the potential advantages of BayesSim. The example involves binomial experiments, whose statistical properties are, of course, well understood. Imagine for the moment, though, that nothing is known about those properties, and we wish to use simulation to learn about, say, the variability of sample proportions. Let θ ^ be the sample proportion in a binomial experiment with n = 100 and success probability θ .
Results of a “traditional” simulation are shown in
θ have been chosen: ( θ 1 , ⋯ , θ 5 ) = ( 0.06 , 0.28 , 0.50 , 0.72 , 0.94 ) . Five hundred binomial experiments are performed at each θ j , and
R ( θ ) = E [ ( θ ^ − θ ) 2 ] = θ ( 1 − θ ) / 100 .
One could form an estimate of the mean squared error curve, R ( θ ) , using the information from the traditional simulation. A possible approach would be to use a Fourier series estimate of the form
R ^ T ( θ ) = ϕ ^ 0 + 2 ∑ j = 1 4 ϕ ^ j cos ( π j θ ) , 0 ≤ θ ≤ 1 , (1)
where
ϕ ^ j = 0.22 ∑ k = 1 5 Y ¯ k cos ( π j θ k ) , k = 0 , ⋯ , 4 , (2)
and Y ¯ j is the average of the 500 mean squared errors at θ j , j = 1 , ⋯ , 5 .
It is reasonable to define the efficiency of an estimator R ^ by its mean integrated square error:
Eff ( R ^ ) = E [ ∫ 0 1 ( R ^ ( θ ) − R ( θ ) ) 2 d θ ] . (3)
Doing so is analogous to the well-known practice of using information to measure the efficiency of an experimental design.
We close this section with the following remarks.
・ The results in
・ The Fourier series estimate in our toy example actually works quite well in spite of the fact that only five settings of θ were considered. This is due to the very smooth nature of R ( θ ) in this case. However, one should re-
Method | |||
---|---|---|---|
Traditional | 15.03 | 8.40 | 23.43 |
BayesSim | 14.58 | 5.95 | 20.53 |
member that this is a toy example. If the functional form of R ( θ ) were more complicated and/or there were multiple parameters, the advantage of BayesSim in conjunction with smoothing would be more pronounced.
It is important to clarify what will be meant by our use of the term “model.” Occasionally we will use model in the sense of “parametric model,” meaning a collection of probability density or mass functions indexed by a parameter. More often, however, model will mean a completely specified probability distribution from which data might be generated. In this sense, a standard normal dis- tribution could be a model, if that were the density that generated independent observations. This use of model makes it easier to describe our most general methodology, wherein each “model” may be definable only in terms of infinitely many parameters, or where one might be considering several different para- metric families all at once. In the latter case, one might employ BayesSim by first randomly choosing a parametric family, and then randomly generating a para- meter value for the chosen family.
Consider how a specific statistical methodology is often used in the real world. Over a certain period of time, N researchers working in different parts of the world and in different subject matter areas, apply this methodology. Let us assume that the true model from which researcher i obtains his or her data is M i , i = 1 , ⋯ , N . Now, let L i be the loss experienced by researcher i upon applying the method in question. Then the efficacy of the method could be measured by the distribution of all Lis, or the average loss N − 1 ∑ i = 1 n L i . This way of evaluating methodology is precisely what BayesSim tries to simulate. I submit that it would rarely happen that any of M 1 , ⋯ , M N would be the same, nor is it common that any researcher would obtain multiple data sets from the same model. For this reason, comparing the efficacy of methods by generating thousands of data sets from the same few models does not seem quite right. The point is that variation due to differences in models is a real and prevalent factor in the performance of statistical methodology and is largely neglected in traditional simulations.
We describe BayesSim using decision theoretic terminology. Let Y be a data vector and suppose that Y has distribution f ( ⋅ | M ) when the true model is M . Let δ ( Y ) be some statistical rule, perhaps an estimator or a test of hypothesis. The “action” prescribed by δ when the observed data are y is δ ( y ) . For example, if δ is an estimator, then δ ( y ) is the point estimate for data set y . Now suppose that L ( M , a ) is the loss when M is the true model and action a is taken. The frequentist risk, or simply risk, of δ when M is the true model is
R ( M , δ ) = E M L ( M , δ ( Y ) ) , (4)
where E M is expectation with respect to the distribution f ( ⋅ | M ) .
The usual simulation strategy is to try to obtain a very good estimate of R ( M , δ ) for a few models M and each rule δ of interest. In estimation problems a typical choice for L is squared error, while in testing L is usually 0 - 1 loss, leading to power and Type I error probability as the criteria for comparing tests.
Now suppose that M is a collection of models, and let π be a probability measure over M . Then the Bayes risk of δ with respect to π is
r ( δ , π ) = ∫ M R ( M , δ ) d π ( M ) . (5)
Obviously, r ( δ , π ) is a weighted average of the risks R ( M , δ ) that are usually considered individually in deciding on a good statistical rule. Bayes principle consists of choosing a rule to minimize Bayes risk, and is one of the main principles used in decision theory. Possibilities for M include parametric families, nonparametric, or infinite dimensional, families, and a collection of the form M = M 1 ∪ ⋯ ∪ M m , where each of M i , i = 1, ⋯ , m , is a parametric family.
In a simulation setting where M and π are chosen by the investigator, a consistent estimator of the Bayes risk of a rule δ may be constructed as follows. Let M 1 , ⋯ , M N be N independent draws from π . Generate a data vector y i from M i for each i = 1 , ⋯ , N , and estimate the Bayes risk of rule δ by
r ^ ( δ , π ) = 1 N ∑ i = 1 N L ( M i , δ ( y i ) ) . (6)
If N → ∞ , then under mild conditions r ^ ( δ , π ) converges in probability to r ( δ , π ) . Consistency follows from the law of large numbers and the fact that
E [ L ( M i , δ ( Y i ) ) ] = E { E [ L ( M i , δ ( Y i ) ) | M i ] } = E { R ( M i , δ ) } = r ( δ , π ) . (7)
Another factor that varies in the practical problem outlined above is sample size. A prior π n for the sample size could also be incorporated into the pro- cedure. On each replication of the simulation, one would first randomly select a model, then select a sample size n from π n , and finally generate a data set of size n . However, mostly for the sake of simplicity, we will assume a fixed sample size throughout the remainder of the article.
A result of [
r ( y ) = ∫ Θ L ( θ , δ ( y ) ) π ( θ | y ) d θ , (8)
where
π ( θ | y ) = f ( y | θ ) π ( θ ) m ( y ) and m ( y ) = ∫ Θ f ( y | θ * ) π ( θ * ) d θ * . (9)
It is easy to see that the expected value of r ( Y ) with respect to the marginal m is equal to r ( δ , π ) .
A second strategy for estimating Bayes risk is therefore:
・ Generate θ from π and Y from f ( ⋅ | θ ) .
・ Repeat the previous step N times, producing r ( y 1 ) , ⋯ , r ( y N ) .
・ Compute the estimate r ^ 1 = N − 1 ∑ i = 1 N r ( y i ) .
Both r ^ 1 and r ^ 2 = r ^ ( δ , π ) are unbiased estimators of r ( δ , π ) . However, r ^ 1 has smaller variance, assuming the density π is not degenerate. This follows from the well-known iterated expectation rule for variance, which yields
Var ( r ^ 2 ) = Var ( r ^ 1 ) + 1 N E [ { L ( θ , δ ( Y ) ) | Y } ] . (10)
In the setting of [
This efficiency result may be of limited usefulness in applications envisioned by the author. The main problem is that it will often be difficult to compute the posterior risk r ( y ) . The necessity of MCMC methods in most Bayesian data analyses is at the heart of this difficulty. However, if one knows how to generate independent samples from each posterior, then there is a fairly simple way to reduce the variability of r ^ 2 without having to compute posterior risk. For each y that is generated in BayesSim , generate a few, say M , independent values from π ( ⋅ | y ) . Call these values θ 1 ( y ) , ⋯ , θ M ( y ) . Now define the estimator
r ^ 3 = 1 N M ∑ i = 1 N ∑ j = 1 M L ( θ j ( Y i ) , Y i ) . (11)
This estimator is unbiased for r ( δ , π ) and has variance
Var ( r ^ 3 ) = Var ( r ^ 1 ) + 1 N M E [ Var { L ( θ , δ ( Y ) ) | Y } ] , (12)
which is smaller than that of r ^ 2 for any M larger than 1.
When one is more interested in using BayesSim to estimate frequentist risk R ( θ , δ ) as a function of θ , then the efficiency issue raised in this section becomes moot. Estimation of R ( θ , δ ) is the subject of Section 3.3.
One could argue that choosing Method A over Method B simply because A has smaller Bayes risk is not a good practice. After all, it is possible that A could have smaller Bayes risk than B, but the risk of B could be smaller than that of A for a majority of models. We have two responses to this point. Firstly, average performance is the default way of comparing methods on a single model. Researchers seem to be sold on the idea that method A is preferable to B for a given model if the risk of A is smaller than that of B, even though B may have smaller loss in a majority of data sets from that model. A second response is that when the prior in BayesSim is actually the empirical distribution of models, BayesSim output can be used to estimate the proportion of cases in practice where one method will have smaller loss than another. This is not something that can be done using the traditional simulation approach.
Even when there are doubts about the practical validity of the prior, valuable local information about methods can be obtained using the output from BayesSim. Suppose it is believed that the risk of a method or methods depends importantly on a certain d -dimensional functional of the model, say g ( M ) . When the model is parametric, g ( M ) might simply be the vector of para- meters. One may regress L ( M i , δ ( y i ) ) on g ( M i ) using BayesSim output in order to estimate the function m ( x ) = [ L ( M i , δ ( Y i ) ) | g ( M i ) = x ] . This will in part mitigate concerns that the Bayes risk involves too much averaging. If m varies with x , then at least part of the variation in risk due to model differences is being explained by the regression. In the (perhaps rare) event that the mapping g is one to one, then m ( x ) is tantamout to the risk R ( M , δ ) .
Suppose the model is parametric and g ( M ) is the vector of parameters θ . By definition in BayesSim, the distribution of Y given θ does not depend on the prior distribution of θ . For this reason, one may obtain a consistent estimator of the risk m ( θ ) even though the prior is “wrong.” (By a “wrong” prior, we mean one that does not actually correspond to the empirical distribution of parameter values over all settings in which the parametric model is to be used.) This is an important point since it shows that if one is more interested in traditional risk than Bayes risk, then the particular prior used is not of utmost importance.
All the existing methods of estimating regression functions can be applied to inference of m from BayesSim output. For example, we may fit a parametric model to the data ( g ( M i ) , L ( M i , δ ( y i ) ) ) , i = 1 , ⋯ , N , which would be es- pecially appealing when d is large.
If no parametric model for the BayesSim output is obvious, and d is large, then the curse of dimensionality comes into play. However, this should not be a real obstacle to the statistician, who simply needs to apply some of the vast array of new regression methodology that deals with this problem. For example, support vector machines ( [
When M is a class of functions, a possible way of summarizing elements of M is to use principal components for curves. We may regard a randomly chosen element f of M as a stochastic process. The process may be re- presented by a Karhunen-Loève expansion of the form
f ( x ) = ∑ i = 1 ∞ θ i b k ( x ) , (13)
where b 1 , b 2 , ⋯ are eigenfunctions each having norm 1. Letting K be the covariance kernel of the process, b 1 has the property that it maximizes
Var [ ∫ b 1 ( x ) f ( x ) d x ] = ∫ ∫ b 1 ( s ) b 1 ( t ) K ( s , t ) d s d t , (14)
subject to the constraint that b 1 have norm 1. Each subsequent function b i maximizes ∫ ∫ b i ( s ) b i ( t ) K ( s , t ) d s d t subject to the constraints that b i has norm 1 and is orthogonal to each of b 1 , ⋯ , b i − 1 .
Now, suppose f 1 , ⋯ , f N are randomly chosen curves in a BayesSim analysis. Then we may compute the first few principal components of each curve, i.e.,
θ i j = ∫ b i ( x ) f j ( x ) d x , i = 1 , ⋯ , I , j = 1 , ⋯ , N , (15)
and use these components as regressors, per the discussion at the beginning of this section.
Ideally the prior distribution π used in BayesSim is a good approximation to the frequency distribution of models encountered in practice. If this is the case, then the Bayes risk truly represents the average loss associated with a method over all cases encountered in practice. Considerations in the choice of π are more or less the same as those used in making an objective choice of prior in a Bayesian data analysis.
Suppose that the collection of models M is parametric. Then if one is investigating methodology to be used in a particular subject matter area, it would be sensible to choose π to approximate whatever is known about the distribution of parameter values in that area. If nothing is known about said distribution, then a noninformative prior could be used.
Likewise, a noninformative prior would seem to be a sensible choice when one is investigating methodology that will be used in a wide variety of subject matter areas. However, the question arises “Do commonly used noninformative priors well-approximate the frequency distribution of parameters over a variety of settings?” There are at least a couple of famous examples of when they ap- parently do. The Jeffreys noninformative prior for a proportion in a binomial experiment is the U-shaped beta distribution proportional to θ − 1 / 2 ( 1 − θ ) − 1 / 2 for θ ∈ ( 0,1 ) . The author of [
Whether the previous two examples are isolated, or there are numerous situations where, say, a Jeffreys prior coincides with the empirical distribution of parameters, is a question beyond the scope of this article. However, it seems to be a worthwhile question for further research, especially if BayesSim were to become a commonly used method of simulation. A means of addressing this question would be to perform meta analyses, or to study the results of existing meta analyses.
When the models of interest are nonparametric, construction of priors becomes more challenging. Here, recent proposals in the Bayesian literature are of interest. For example, a commonly used noninformative prior for probability densities is the Dirichlet process, [
We consider two examples. One is parametric and the other nonparametric, involving the skew-normal distribution and goodness of fit, respectively.
Suppose we have a random sample X 1 , ⋯ , X n from the following skew-normal density:
f ( x | ξ , ω , δ ) = 2 ω ϕ ( x − ξ ω ) Φ ( δ 1 − δ 2 ( x − ξ ω ) ) , − ∞ < x < ∞ , (16)
where ϕ and Φ are the pdf and cdf, respectively, of the standard normal distribution, and the parameter space is { ( ξ , ω , δ ) : − ∞ < ξ < ∞ , ω > 0 , − 1 < δ < 1 } . The quantity δ is the so-called skew parameter and controls the degree of skewness of the distribution. Obviously, when δ = 0 the density is N ( ξ , ω 2 ) . As δ tends to 1(−1), the density approaches a half-normal distribution with left (right) endpoint ξ .
We wish to compare the efficiency of maximum likelihood (ML) and method- of-moments (MOM) estimators of ( ξ , ω , δ ) using BayesSim. This is a good example for use of BayesSim for a couple of reasons. Firstly, finding exact expressions for the mean squared error of estimators in finite samples appears to be hopeless. Secondly, the information matrix of the skew-normal model has a well-known singularity at δ = 0 ; see [
We will consider three priors, which differ only with respect to the dis- tribution of δ . Each prior is such that ( ξ , ω ) is independent of δ , δ has a beta distribution, ω has a gamma distribution with shape and rate parameters each 1 / 2 , and, given ω , ξ is normal with mean 0 and variance ω 2 . For a > 0 and b > 0 , the beta density is proportional to u a − 1 ( 1 − u ) b − 1 for 0 < u < 1 . The three beta densities employed have ( a , b ) equal to ( 1 / 2 ,1 ) , ( 1,1 ) and ( 1, 1 / 2 ) . The first and third of these emphasize values of δ near 0 and 1, respectively, while the second is uniform. Since this example is for the sake of illustration, we consider only the sample size n = 100 . We generate 10,000 independent and identically distributed three-vectors from each of the three priors. For each given three-vector ( ξ , ω , δ ) , we generate one sample of size 100 from a skew-normal distribution having those parameters.
MOM estimates are computed from sample moments using the following equations:
μ = E ( X i ) = ξ + ω δ 2 π (17)
σ 2 = Var ( X i ) = ω 2 ( 1 − 2 δ 2 π ) (18)
γ = E [ ( X i − μ ) 3 ] σ 3 = ( 4 − π 2 ) δ 3 ( π / 2 − δ 2 ) 3 / 2 (19)
Maximum likelihood estimates were approximated using the nonlinear optimization routine optim in R.
Basic results are summarized in
is ∑ i = 1 10 , 000 ( δ ^ i , M L E − δ i ) 2 / 10 , 000 , where δ i and δ ^ i , M L E are the value of δ and
the ML estimate of δ , respectively, generated on the i th replication of the simulation that used the Beta ( 1 / 2 ,1 ) prior. One interesting conclusion is that the performance of both ML and MOM estimators degrades as the proportion of δ s near 1 becomes less. Also, not surprisingly, the ML estimators are sub- stantially more efficient than the MOM estimators.
More detailed information can be obtained by studying the relationship between estimation error and the parameters ( ξ , ω , δ ) . Consider
Parameter | ||||
---|---|---|---|---|
Prior | ||||
Beta | MLE | 0.404 | 0.092 | 0.099 |
MOM | 1.426 | 0.155 | 0.522 | |
CI | (0.924, 1.121) | (0.052, 0.073) | (0.413, 0.432) | |
Uniform | MLE | 0.349 | 0.077 | 0.109 |
MOM | 1.137 | 0.102 | 0.499 | |
CI | (0.700, 0.877) | (0.016, 0.032) | (0.379, 0.401) | |
Beta | MLE | 0.290 | 0.073 | 0.083 |
MOM | 0.903 | 0.092 | 0.361 | |
CI | (0.525, 0.699) | (0.010, 0.027) | (0.268, 0.289) |
Parameter | |||
---|---|---|---|
Prior | |||
Beta | (0.762, 0.779) | (0.627, 0.646) | (0.777, 0.793) |
Uniform | (0.695, 0.713) | (0.548, 0.569) | (0.717, 0.734) |
Beta | (0.672, 0.690) | (0.548, 0.568) | (0.713, 0.731) |
scattered widely throughout the interval ( 0,1 ) , but a small proportion of points lie very near the 45 degree line, as one would hope most of them would. The solid line is a local linear estimate, indicating considerable bias in the ML estimates for most values of δ . Further investigation shows that the parameter ω explains the peculiar scatterplot. In
Although it is possible that conclusions as in the last two paragraphs would arise from a carefully done conventional simulation, it seems that the con- clusions are much clearer and more easily reached when BayesSim is used.
Here we compare the performance of two nonparametric goodness-of-fit tests of the null hypothesis that a data set is drawn from a normal distribution. Let X 1 , ⋯ , X n be a random sample from an unknown density f . We wish to test the null hypothesis that f is normally distributed with mean μ and variance
σ 2 , where neither μ nor σ 2 is specified. The tests considered are the popular Shapiro-Wilk test ( [
Λ ( h ) = ∏ i = 1 n f ^ i ( X i | h ) [ 2 π 2 ] − n / 2 e x p ( − n / 2 ) , (20)
where σ ^ 2 is the maximum likelihood estimator of σ 2 on the assumption of normality, and f ^ i ( ⋅ | h ) is a “leave-one-out” kernel density estimator:
f ^ i ( x | h ) = 1 ( n − 1 ) h ∑ j = i K ( x − X j h ) . (21)
The numerator of Λ ( h ) is the likelihood cross-validation criterion, as studied by [
K ( z ) = { ( 8 πe ) 1 / 2 Φ ( 1 ) } − 1 exp [ − 1 2 { ln ( 1 + | z | ) } 2 ] . (22)
Results in [
Our interest is in comparing the power of the two tests, and so in using BayesSim we generate densities that are not normal. Each density generated is a mixture of two or more normals, having the form
f ( x ) = ∑ i = 1 M w i 1 σ i ϕ ( x − μ i σ i ) . (23)
Densities are generated as follows:
・ A value of M between 2 and 20 is selected from a distribution such that the probability of m is proportional to m − 1 , m = 2 , ⋯ , 20 .
・ Given m , values w 1 , ⋯ , w m are selected from the Dirichlet distribution with all m parameters equal to 1 / 2 .
・ Given m and independent of w 1 , ⋯ , w m , 1 / σ 1 2 , ⋯ , 1 / σ m 2 are independent and identically distributed as gamma with shape and rate each 1 / 2 , and, conditional on σ 1 , ⋯ , σ m , μ 1 , ⋯ , μ m are independent with μ j distributed N ( 0, σ j 2 ) , j = 1 , ⋯ , m .
Generating densities in this way provides a variety of different distributional types, including ones that are skewed, lepto- or platykurtic, and/or multimodal. Furthermore, functionals of interest, such as moments and norms are readily calculated for normal mixtures, as pointed out by [
Ten thousand densities were generated independently as described above, and then one data set of size n = 200 was generated from each of these densities. A P -value for the Shapiro-Wilk test was determined using the R function shapiro. test. The critical value of a size 0.05 test based on Λ ^ was approximated by generating 10,000 values from a standard normal density.
Since we are interested in power, a 0 - 1 loss function is used, in which case the Bayes risk of each test is estimated by the proportion of rejections among all ten thousand replications. These two proportions were 0.8294 and 0.8003 for the Shapiro-Wilk and kernel-based tests, respectively. A 95% confidence interval for the difference in Bayes risks is ( 0.0245 , 0.0334 ) .
To gain more insight into the difference between the tests, we considered the relationship of rejection probabilities to various characteristics of the selected densities. Letting μ and σ denote the mean and standard deviation of a selected density f , we computed the skewness coefficient E ( X − μ ) 3 / σ 3 , the excess kurtosis E ( X − μ ) 4 / σ 4 − 3 and the Kullback-Leibler divergence
K L ( f ) = − ln σ − ∫ − ∞ ∞ ϕ ( z ) l n f ( σ z + μ ) d z (24)
where ϕ is the standard normal density. The quantity K L ( f ) measures the discrepancy between ϕ and a standardized version of f . It represents the consistency parameter for the kernel-based test, and hence a strong dependence is expected between K L ( f ) and the power of that test. This is illustrated in
The same plot as in
suggest that the only substantial difference in the power of the two tests occurs at small values of l n ( K L ( f ) ) .
Finally, we present heat plots showing how the power of the two tests are related to skewness and kurtosis. Define the transformation t by t ( x ) = sgn ( x ) | x | 1 / 5 , for any real x . Letting s and k denote values of the skewness and kurtosis coefficents defined previously, define independent variables t ( s ) and t ( k ) . Also define y 1 i to be 1 if the Shapiro-Wilk test rejected H 0 for replication i of the simulation and 0 otherwise. The variables y 2 i , i = 1 , ⋯ , 10 , 000 are defined similarly for the kernel-based test. We then computed Nadaraya-Watson kernel estimates from the data sets ( t ( s i ) , t ( k i ) , y 1 i ) , i = 1 , ⋯ , 10 , 000 , and ( t ( s i ) , t ( k i ) , y 2 i ) , i = 1 , ⋯ , 10 , 000 . These estimates are shown in
An overlooked method of simulation called BayesSim has been propounded. The method provides the possibility of learning more about the statistical methodology being considered while at the same time generating no more (or even less) data than in the conventional method. Does the author think that BayesSim should supplant the conventional approach? Of course not. There are undoubtedly many situations where the latter method will be preferable. However, in the author’s opinion there are equally many cases where BayesSim can provide valuable information that is not attainable by generating data from just a few models. Furthermore, conclusions that might be reached with the conventional approach are often much more evident and compelling when data from thousands of different models are considered.
Some will argue that the increasingly complicated methodology of modern statistics would make it difficult to choose satisfactory priors in BayesSim. That may be true but the process of choosing a prior in BayesSim cannot be more difficult than choosing a prior for a complex model in a Bayesian data analysis. And the difficulty of choosing priors has certainly not prevented an explosion in the use of Bayesian statistics.
Another reason the author finds BayesSim appealing is that it provides a rich source of data to which statistical researchers may apply the sophisticated methods that they are constantly developing. Kernel methods, nonparametric additive modeling, projection pursuit and support vector machines are but a few of the methods that could be used to investigate how, for example, estimator performance is related to parameters. In short, I believe that statisticians will find that BayesSim can make their simulation studies more informative, and more interesting.
This research was supported by National Science Foundation Grant DMS- 0604801.
Hart, J.D. (2017) Use of BayesSim and Smoothing to Enhance Simulation Studies. Open Journal of Statistics, 7, 153-172. https://doi.org/10.4236/ojs.2017.71012