Open Journal of Statistics, 2012, 2, 389400 http://dx.doi.org/10.4236/ojs.2012.24047 Published Online October 2012 (http://www.SciRP.org/journal/ojs) Linear Maximum Likelihood Regression Analysis for Untransformed LogNormally Distributed Data Sara M. Gustavsson, Sandra Johannesson, Gerd Sallsten, Eva M. Andersson Department of Occupational and Environmental Medicine, Sahlgrenska University Hospital, Academy at University of Gothenburg, Gothenburg, Sweden Email: sara.gustavsson@amm.gu.se Received June 25, 2012; revised July 27, 2012; accepted August 10, 2012 ABSTRACT Medical research data are often skewed and heteroscedastic. It has therefore become practice to logtransform data in regression analysis, in order to stabilize the variance. Regression analysis on logtransformed data estimates the relative effect, whereas it is often the absolute effect of a predictor that is of interest. We propose a maximum likelihood (ML)based approach to estimate a linear regression model on lognormal, heteroscedastic data. The new method was evaluated with a large simulation study. Lognormal observations were generated according to the simulation models and parameters were estimated using the new ML method, ordinary leastsquares regression (LS) and weighed leastsquares regression (WLS). All three methods produced unbiased estimates of parameters and expected response, and ML and WLS yielded smaller standard errors than LS. The approximate normality of the Wald statistic, used for tests of the ML estimates, in most situations produced correct type I error risk. Only ML and WLS produced correct confidence intervals for the estimated expected value. ML had the highest power for tests regarding β1. Keywords: Heteroscedasticity; Maximum Likelihood Estimation; Linear Regression Model; LogNormal Distribution; Weighed LeastSquares Regression 1. Introduction Measurements in occupational and environmental re search, e.g. exposure and biomarkers, often have a skewed distribution with a median smaller than the mean and only positive values. It is also common with heterosce dasticity where the variance increases with the expected value. Such data can often be described by a lognormal or quasilognormal distribution [1]. Associations (for example between exposure and health effects/biomarkers or between personal exposure and background variables) are often analyzed using re gression models. Ordinary least squares (LS) regression analysis is a commonly used regression method, and it is based on the assumption of a constant variance and a normal distribution for the stochastic term. One way of handling nonnormal data is with nonparametric median regression (or LeastAbsoluteValue regression), see e.g. [2] where no assumptions are made about the distribution of the response variable. The parameter estimates are then found by minimizing the sum of absolute value of the residuals (whereas LS minimizes the sum of squared residuals). However, as a nonparametric method it re quires larger samples and it may have multiple solutions [3]. In situations where the response variable has a skewed distribution and an increasing variance, it has become the practice to logtransform the response variable. Regres sion analysis on logtransformed data have for instance been used to establish reference models [4], to find suit able biomarkers [5], to determine suitable surrogates for exposure [6,7], and to estimate the cost function in health economics [8]. A model in which the response variable is logtransformed, ln(Y), will estimate the relative effect of each predictor, whereas in many cases it is the absolute effect that is desired [9]. It must also be considered that e.g. a ttest for comparing the expected values of two groups based on the mean of ln(Y) is not equal to a test based on the mean of the original lognormal data Y, since the expected value of Y is a function of both μ and σ, whereas the expected value of ln(Y) is a function of only μ. If the two σparameters are not equal, a ttest based on ln(Y) may not give the correct type Ierror re garding the difference between E[Y1] and E[Y2], [10,11]. In many cases a linear relationship between the re sponse and the predictor (e.g. between personal exposure and background variables) is a reasonable assumption; for example it is realistic that the personal exposure in creases linearly with time spent in a certain environment (e.g. time spent in traffic). This linearity will be lost in a C opyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 390 logtransformation. On the other hand, if the lognormal distribution is ignored in order to preserve the linearity, tests based on the assumption of a constant variance may give misleading results, [12]. There is a need for methods that handle lognormally distributed data in linear regression models, based on moderate sample sizes. In order to estimate the linear association (and the absolute effect), but still take into account the lognormal distribution with a nonconstant variance, we propose a maximum likelihood (ML) based method for regression analysis. In this paper we have evaluated this new method using large scale simulations, which allowed us to analyze the bias, variance and dis tribution of the regression coefficients resulting from the new method, as well as comparing it to LS and weighted leastsquares (WLS) regression analysis. A data set on personal exposure to 1,3butadiene in five Swedish cities was used to illustrate the three methods. 2. Data We considered the situation where the response variable Y is assumed to follow a lognormal distribution (i.e. ln(Y) is normally distributed) and where the expected value is assumed to be a linear function of the predictors, 01 EYX 1Ypp .X In regression analy sis, some Xvariables can be included because of known (or suspected) association with Y, in order to decrease the variance or lower the risk of confounder effects [5]. Since Y follows a lognormal distribution, ln(Y) can be expressed using the following model 22 , pZi X e 2 eN 01 1 ln ln ip YX where iZ ~0, . The k model above. Linear regression analysis on the log transformed data would yield an estimate of the relative effect of Xk, rather than the absolute effect. This is illus trated in Figure 1. A logtransformation is not suitable in situations when the aim is to estimate the absolute effect of X on Y (rather than the relative one). In a licentiate thesis [13] it was suggested that the linear regression should be estimated from untransformed lognormal data using maximum likelihood methods. In this article, the properties of these maximum likelihood estimates will be evaluated. 2.1. Simulation Study The properties of a new method for statistical analysis can be derived theoretically and/or by simulations and examples. In a simulation study, a model for the variable of interest (here personal exposure) is used to generate samples of observations, 1n and then the pa rameters under investigation (here regression parameters) are estimated from each sample. This is repeated to ob tain a distinctive distribution for the estimates. Our simulation study allowed us to assess the bias and stan dard errors of the parameters estimated resulting from the new maximum likelihoodbased regression analysis, and compare these results to those of LS and WLS. ,,yy We used simulation models in which the response, personal exposure to Particulate Matter smaller than 2.5 μm (PM2.5), was assumed to be a linear function of back ground variables (residential outdoor level of PM2.5, smoking and time spent in home). Two different simula tion models were used to generate data. Model A had only one predictor, the personal exposure to PM2.5parti cles (μg/m3), Y, was assumed to be a linear function of the residential outdoor concentration of PM2.5 (μg/m3), ConcOut. Model B had three predictors and no intercept, ˆ is an estimate of the absolute effect of Xk on Y. The logtransformation results in a distortion of the linearity, as can be seen in the Figure 1. A linear regression where YX follows a lognormal distribution. The absolute effect is 0.9 (left), the log transformation stabilizes the variance but distort the linear relation (middle) and the estimation based on log(Y) result in an xponential function with a relative effect of 29% (right). e Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 391 Y was a linear function of the number of cigarettes per day, Smoke, number of hours spent in their own home, Home, and residential outdoor concentration of PM2.5 (μg/m3), ConcOut. Since ConcOut > 0, its regression coefficient can be interpreted as a stochastic intercept. Datasets were simulated by generating normally distrib uted observations from Model A: 2 0.354 2, ii e 22 , 0.354 lnln 4.8030.574Y ConcOut where iZ . Samples from Model B were simulated according to ~0eN 2 ln ln 2.0920.761 0.4502, i i Y Smoke ConcOut e 0.218 Home 22 , 0.450. where iZ The parameters in the simulation models, 01 ~0eN ,, and 123 ,,, ˆ E , were estimated from real measurement data (Johannes son et al. [14]). The number of repetitions needed in the simulation study was estimated. In order to obtain a 95% confidence interval for that is smaller than 2·0.0005, 4 mil lion samples were needed. For the predictors, discrete values were used: ConcOut = {2, 8, 14}, Smoke = {0, 7, 14} and Home = {8, 16, 24}. Sample sizes n = 108 and n = 216 were used in the simulations and the data sets were balanced with regard to the predictors. For Model A a second set is also created with a data structure similar to the observed one in the original dataset [14], which was slightly unbalanced. 2.2. Application to Exposure Data The properties of the three regression methods (LS, WLS and ML) were illustrated using a set of data on personal exposure of 1,3butadiene from five Swedish cities. 1,3 butadiene is an alkene and has been listed as a known carcinogen by the International Agency for Research on Cancer (IARC). Traffic and exposure to tobacco smoke are considered to be two sources for personal exposure to 1,3butadiene [15,16]. Wood burning has also been showed to increase personal exposure [17]. The dataset was collected in a study of exposure to carcinogens in urban air in five Swedish cities; Gothenburg, Umea, Malmo, Stockholm and Lindesberg, see [18], and con sisted of 268 measurements of personal 1,3butadiene exposures. Background data were collected by a ques tionnaire. 3. Methods In our investigation, the outcome variable Y was as sumed to be lognormal with an expected value that was a linear function of the predictors; 101,1 1, ,,pipp YX Xxx From the simulated data, the parameters of the regres sion model were estimated using ordinary leastsquares and weightedleastsquares estimation as well as maxi mumlikelihood estimation, as described below. 3.1. LeastSquares Estimates As mentioned before, the inference in ordinary least squares regression method, LS, is made under the as sumption that 2 ~iid ,YX N 22 ,iYiYi ,Yi Y where 2 ˆMSE ˆ . The standard deviation is assumed constant for all Y and . The covariance matrix for the Y vector S is estimated as LS 1 ˆ cov MSE X ˆ , where X is a n × (p + 1) matrix. The standard errors, SE( S ), are estimated from the diagonal elements of ˆ cov S . 3.2. WeightedLeastSquares Estimation For data that cannot be considered homoscedastic, there are several estimation procedures, for instance the White heteroscedasticity consistent estimator (see e.g. [19], p. 199). Since Y is assumed to follow a lognormal distri bution, the nature of the heteroscedasticity is known; 2 22 e1 Z ,i Yi Y 2 . The variance is a function of the expected value. Thus weightedleastsquares regression analysis, WLS, is appropriate, in which each observation is weighted using i iY W 22 , ˆ iiLS Y WY . The weights can be esti mated from the LS regression; i WLS ˆ . The co variance matrix for the vector is estimated as 1 WLS ˆ cov MSEW X'WX ,, yy 01 ,,, , . 3.3. Maximum Likelihood Estimation in Regression The maximum likelihood estimator (MLE) of some pa rameter θ is the value at which the likelihood function L(θ) attains its maximum as a function of θ, with the sample 1n held fixed. The log likelihood is often more convenient to work with and since the logarithm transformation is monotone, the function log L(θy) can be maximized instead. For a continuous variable the like lihood function is the same as the probability density function, pdf. The MLE of θ can be found by differenti ating log L(θy) with respect to θ. When θ is a scalar it is enough that the likelihood function is differentiable to get a direct estimate of θ. In a regression analysis, the aim is to estimate the parameter vector Z θ 12, ,, n yy y . Let be a lognormal sample where Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 392 ˆi 01ii xEY 2 ~ , i izZ YN . Then ln where 2 01 σ2 ziZ ln ix and the log likelihood function is 2 2ln . Z i i y 2 01 2 lnlnln 2π 2 1ln ln 2 Z ii i Z n Ln yx The derivatives were previously calculated by Yurgens [13] and are given below with some corrections: 2 , 2 Z ii jj j 2 ln 1ln ln ik i kjijj Z x Lyx x 2 22 ln 11ln 2 ny , 2 Z i l mk ik im ij j ij Zjijj L xx x x 2 , 4 3 Z ln1 ln ln ii ij ZZ Ln yx jj n 2 3 ln 2ln ln ik il Zk jijj Z x L x , ii ll yx 2 . 4 iijj n yx ,,, , 2 22 4 ln3 ln ln ij ZZ Z Ln The maximum likelihood estimates of 01 Z 01 ˆˆ ˆ ,,, θ can be found by iterations, for example the NewtonRaphson algorithm, see [20]. The covariance matrix for the vector ˆ ˆi θ is estimated by the inverse of the observed Fisher informa tion matrix (see e.g, [21]), where the elements are the negative second derivative of the loglikelihood. One of the known properties of MLE is, under some regularity conditions, its asymptotic normality; when the sample size increases the MLE of θ tend to a normal distribution with expected value θ and a covariance matrix equal to the inverse of the Fisher information matrix (see e.g, [21]). 3.4. Descriptive Statistics and Inference In the simulation study, samples of lognormal observa tions were generated and three different methods were used to estimate the regression coefficients. The results from the three methods were compared using expected value (mean), standard error, bias, skewness, the 95% central range and correlation of the regression estimates. The standard error of was denoted SE ˆi and ˆi se the sample specific standard error for , , was the estimate of i ˆ SE . The bias of an estimator, ii ˆ E , was used as a measure of the systematic error. The s kewness of an estimator was estimated as 3 ˆˆˆ EESE . For lognormal data, the 22 e2e1 skewness is , while a symmetric distribution has γ = 0. It has been suggested that a sample with skewness less than 0.5 should be considered as ap proximately symmetric while skewness above 1 may be considered highly skewed. The 95% central range, CR95, was defined as the difference between the 2.5th and 97.5th percentiles. The samplespecific standard error of ˆYX is 2 2 1 ˆˆˆ ˆcov ,, p iiij ij YX iij se μxse xx ˆˆ , ij cov where x0 = 1 and ˆˆ cov , ij is the sample specific estimation of *0, . The inference properties of the three methods were evaluated by comparing the results from tests of the null hypothesis H0:β = β*, where T in which βT is the value specified in the simulation model. LS and WLS estimates was tested with the ttest, ˆˆˆ tse , which follows a Student’s tdis tribution. The ML estimates were tested using a Wald type of test statistic, which utilized the large sample normality of the MLestimator and the observed Fisher information. Under H0, the Wald statistic ˆˆˆ Wse tribution and asymptotically follows a N(0, 1) dis 2 ˆ W asymptotically follows a chi , see e. n used on 4. Results e presented in four sections; the distribution square distributiong. [21], Section 9.4. Other pos sible largesample test procedures for ML estimates (not used in this study) are the score statistic and the full like lihood ratio statistic, see e.g. [21]. We chose the Wald statistic because of its computational advantages. The properties of the test statistics above, whe lognormal data, were evaluated by the risk of type I error and the power. The true probability of a type I error for a specified nominal αvalue (denoted α*) was esti mated as the proportion of test statistic values beyond the respective critical value. The power results were based on simulations in which the tested parameter (e.g. β0) was varied according to H1 whereas the other parameter (e.g. β1) was held constant. The results ar Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. Copyright Res. OJS 393 of ˆi and ˆ , predictions ˆYX , inference and an appation toxposure data foutadiene. lic er 1,3b 4.1. The Distribution of the Estimates all three ˆ SE 1. Of the three methods, ML gave the smallest i , 11% smaller than that of LS. For LS, 0 ˆ se overesti mated 0 ˆ SE by over 35% for the balanced data and about 18% for the unbalanced. For all three methods the standard error increased when unbalanced data were used. LS had the largest CR95, while ML has the smallest (the differences were around 12% between LS and ML and For data generated according to Model A, methods produced unbiased estimates of the βcoeffi cients (the absolute effect of the Xvariables), see Table Table 1. The expected value (E[–]), errors (SE[–], E[se(–)]) and skewness (γ[–]) for the estimates of β, and the ex n = 108 n = 216 standard pected value and standard errors for the estimate of σZa. LS WLS ML LS WLS ML Balanced data β0 = 4.803 ˆ E 4.803 4.803 4.806 4.803 4.803 4.804 ˆ SE 0.486 0.441 0.430 0.343 0.312 0.304 ˆ Ese ˆ 0.656 0.438 0.424 0.465 0.311 0.302 95 CR 0.063 0.151 0.143 0.045 0.107 0.100 ˆ β1 = 0.ˆ E 1.904 1.730 1.685 1.346 1.223 1.190 574 ˆ SE 0.574 0.574 0.574 0.574 0.574 0.574 0.072 0.066 0.064 0.051 0.047 0.045 ˆ Ese ˆ 0.070 0.066 0.064 0.050 0.047 0.045 95 CR 0.125 0.054 0.053 0.089 0.039 0.039 ˆ σZ = 0.ˆ 0.282 0.260 0.252 0.199 0.183 0.178 354 E ˆ  0.351 0.350  0.353 0.352 SE E  0.028 0.024  0.020 0.017 Unbalanced β0 = 4.803 ˆ 4.803 4.803 4.807 4.803 4.803 4.805 ˆ SE 0.564 0.488 0.475 0.399 0.345 0.335 ˆ Ese 0.665 0.484 0.469 0.472 0.344 0.334 ˆ −0.030 0.156 0.149 −0.020 0.110 0.105 95 CR ˆ β1 = 0.574 ˆ E 2.215 1.911 1.860 1.565 1.352 1.315 ˆ SE 0.574 0.574 0.573 0.574 0.574 0.574 0.089 0.078 0.075 0.063 0.055 0.053 ˆ Ese 0.082 0.077 0.075 0.058 0.055 0.053 ˆ 0.184 0.024 0.026 0.130 0.017 0.018 95 CR ˆ σZˆ E 0.347 0.305 0.296 0.246 0.215 0.209 = 0.354  0.351 0.350  0.353 0.352 ˆ SE  0.028 0.024  0.020 0.017 aData were ge Model A, 4 million iterations. For reference, the skewness of the normal distribution is γ = 0, whereas the lognormal data y1···yn nerated from from Model A has skewness γ = 1.15. © 2012 Sci
S. M. GUSTAVSSON ET AL. 394 3% between WLS and ML). There was a strong associa tion between the LS and MLestimates; the correlation between the estimates was 0.88 and 0.90 for 0 ˆ and 1 ˆ respectively. The association was weaker fohigh lues. The WLS and MLestimates were even more similar, with a correlation of 0.97 for both the β0 and β1estimates, and with no weakening of association for higher values. For σZ, both WLS and ML showed only small biases that decreased with increasing sample size, Table 1. ML gave the smallest standard error and seemed robust to unbalanced data. Since there is no gen erally accepted method for estimating σZ with LS, no such results were presented. Data were also generated fr r va om versions of Model A in which one of the βparameters was set to 0. All three methods produced unbiased estimates of β and the stan dard errors were much smaller than the results shown in Table 1 (both for ˆ SE and ˆ Ese ); for the zeroparameter the sten 45 and 76% smaller than the corresponding value in Table 1, and for the other parameter the standard error was be tween 26% and 52% smaller. For a situation where the intercept is zero (β0 = 0), ML gave the smallest ˆ SE andard error was betwe for both parameters, 50% smaller than that of r both ML and WLS, LS. Fo 0 ˆ se was a good estimator for 0 ˆ SE , but for LS 0 ˆ se overestimated 0 ˆ SE by aboutuation or X has 80%. For a sitwhere the predict no effect on the response Y (β1 = 0), all three methods produced approximately the same 1 ˆ SE and 1 ˆ se was a good estimator for SE1 ˆ σZes (and their standard error) werndent of the values of β0 and β1 and had the same values as in Table 1. For data with a large variation (large σZ) we det . The epe timates e ind ected th rding to Model B. ML pr e occasional estimation problem for ML. The ML method is based on iterations whereas LS and WLS have analytical expressions for the parameter estimates, and ML was more sensitive to large outliers. Situations in which ML produces unreasonable results can sometimes be avoided by excluding the extreme observations, but all three methods will then tend to underestimate both the intercept and the standard errors. Data were also generated acco ovided the smallest variation (ˆi SE was between 18% and 44% smaller than that of LSere was a very ). Th small underestimation of σZ, but the bias decreases with increasing sample size, as expected according to the properties of MLE. As before, ML had the smaller ˆ SE , Table 2. For both ML and WLS, ˆ se as a imator for ˆ SE good est . For LS, ˆ se 1 underes timated 1 ˆ SE by% while s 7 3 ˆ e %  8 ˆi SE did over estimate . The i by 12%  13%, Table 2ˆ SE depends on the value of βi, the range and values of the Xvariables and the value of σZ. A separate simulation ucted in 1, σZ = ke and was cond which β0 = 0, β1 = β2 = β3 = 0.450 and where all predictors (ConcOu t, Smo Home) hade values between 2 and 14 and the results showed that for this situation all the standard errors were the same; LS 1LS2LS3 ˆˆˆ 0.196SE SESE , WLS 1WLS2WLS3 ˆˆˆ 0.169SE SESE and ML 1ML2ML 3 ˆˆˆ0.161SESESE Predictions The results in Tables 1 and 2 illustrated that ˆ . 4.2. the  values were approximately symmetrical and hence the o same would apply tˆYX for fixed x. All three meth d estimates of the βparameters and ods provided unbiase thus the point estimates of μYX will be unbiased. For ML and WLS, ˆYX se μ did adequately estimate ˆYX SE μ , while LS produced too large standard errors for x and too small for x. This will for most values of x result in ernfidence intervals for Ld produce slighwer confidence intervals th roneous coS. ML di tly narroan WLS, ressio see Figure 2. For a simple regn model (with intercept and one Xvariable) it can be shown that ˆYX SE μ has a minimum at 0,1 01 ˆˆ xSESE x , in Figure 2 at ≈ 5 0,10 1 ˆˆ Corr , . As mentioned above, this minimum was well estimated with both d WLS, ML an icated a minimum at x = 8while LS ind. The incorrect ˆ se μ fo YX the underestimated corre lation (0,1< 0,1) and overestimating the standard error ( r LS is a result of rρ 00 ˆˆ Ese SE ). For a regression model with no and several Xvariables, intercept ˆYX SE μ is always minimized for 12 0 p xx x . For a multiple reh an intercept, the minimum gression model wit ˆ μYX SE can be foun d by solving the equation system 11 ˆ Var 0,1, ,. pp YX i ip x ,,xXx 4.3. Inference Hypothesis testing regarding separate regression pa rameters using the ttest (LS and WLS) or the Wald test ted. Data were generated according to (ML) was evalua versions of Model A where one parameter was set to zero (βi = 0, I = {0, 1}). The null hypothesis H0:βi = 0 was tested against both one and twosided alternatives in a situation where H0 was true, Table 3. Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 395 Table 2. The expected value (E[–]), standard errors (SE[–[se(–)]) , skewness (γ[–]) and CR95 for the estimates of βa. ], E n = 108 n = 216 LS WLS ML LS ML WLS β1 = 2.092 ˆ E 2.092 2.092 2.087 2.092 2.092 2.090 ˆ SE 0.243 0.208 0.200 0.172 0.147 0.141 ˆ Ese ˆ 0.224 0.205 0.196 0.159 0.146 0.140 95 ˆ CR 0.213 0.125 0.121 0.150 0.090 0.085 β2 = 0.761 ˆ E 0.953 0.816 0.783 0.674 0.577 0.554 ˆ SE 0.761 0.761 0.760 0.761 0.761 0.760 0.206 0.146 0.139 0.146 0.103 0.099 ˆ Ese ˆ 0.205 0.144 0.137 0.146 0.103 0.098 95 ˆ CR 0.064 0.133 0.124 0.043 0.094 0.087 β3 = 0.218 ˆ E 0.812 0.574 0.547 0.573 0.406 0.386 ˆ SE 0.218 0.218 0.220 0.218 0.218 0.219 0.116 0.068 0.065 0.082 0.048 0.046 ˆ Ese ˆ 0.131 0.066 0.063 0.093 0.047 0.045 − − 95 ˆ CR 0.102 0.282 0.252 0.074 0.198 0.177 σZ = 0.450 ˆ 0.458 0.265 0.253 0.323 0.187 0.179 E ˆ  0.444 0.443  0.447 0.446 SE  0.038 0.031  0.028 0.022 aData were generated from Model B, 4 million iterations. For reference, thess of theal distribution is γ, whereas tnormal dy···yn from Model B has skewness γ = 1.53. skewne norm = 0he logata 1 Figure 2. The ˆ ESμ x0 = and ˆ se μ E x0 = for s, aped from Model A ed ons with sample size n = 108. 0 LSestimates produced an α* which was much smaller than α (α* < 0.2α, regardless of H1). This was a result of on of ˆ SE . the three method Results were basplied to data generat 4 million iteration For a model with no intercept (β = 0), tests of the the overestimati0 by 0 ˆ se . For tes be ts of WLS and MLestimates, α* was approximately equal to α (within 10% of the true value). A slight skewness could observed; α* ≤ α for H1: β0 > 0 and α* ≥ α for H1:β0 < 0, while α* for H1:β0 ≠ 0 was slightly higher or the same as α. The source of this skewness was a positive correla ˆ tion between 0 and 0 se ˆ; smalues of 0 ˆ all v gave more extreme (nFor a model egative) testscores. eom Mosis where X has no effect on Y (β1 = 0), all three tests pro duced α*values close to α (within 20% of the true value). For ML, α* was slightly higher than for LS and WLS. The risk of type I error was approximately the same even when the sample size was doubled (n = 216). Data were genrated frdel A and the hypothe H0:βi = βT was tested (βT was the parameter value speci fied in Model A). The results for WLS and ML (data not shown) were that α* was within 30% and 16% of α, re Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 396 spectively for nominal size 0.05 and 0.10. For nominal size 0.01 the relative deviation could be up to 70%. As before, tests of the LSestimates of β prod 0 uch too small (α* ≤ 0.1α). Both WLS and MLpro duced a slight skewness regarding tests of both β0 and β1, similar to that in Table 3. Data were also generated according to Model B (data uced an α* m ˆi SE not shown) and the hypothesis H0:βi = βT was tested. For all three methods, α* was closest to α for nominal size 0.10. The relative deviation was largest for nominal size 0.01; 0.7α ≤ α* ≤ 2.2α. Again, a skewness (as above) could be seen in the tests of all three parameters, caused by the negative skewness of the test statistics. For tests of the LSestimate of β1, α* was consistently too large (1.04α ≤ α* ≤ 2.20α), whereas α* was consistently too small in tests of β3 (0.30α ≤ α* ≤ 0.76α). These erroneous risks were, again, the result of under and overestimation, respectively, of ided and onesided tests regarding β0 and β1a. α = 0.050 α = 0.100 , see Table 2. Tests regarding the LSestimates of β2 followed the same pattern as the tests of all the WLS and MLestimates. The risk of type I error was approximately the same even when the sam ple size was doubled (n = 216). The power of the tests, under the alternative hypothe sis H1:β1 > 0, was estimated for both α = 0.05 and α* = 0.05, see Table 4. When α = 0.05, ML had the highest power while LS had the lowest. For LS the type I error Table 3. Risk for type I error (α*) for twos α = 0.010 Alternative hypothesis H1: βi ≠ 0 βi < 0 βi > 0 βi ≠ 0 βi < 0 βi > 0 βi ≠ 0 βi < 0 βi > 0 βi n Test β0 108 LS ttest 0.000 0.000 0.000 0.001 0.001 0.003 0.004 0.007 0.016 108 WLS ttest 0.010 0.012 0.008 0.051 0.055 0.047 0.102 0.106 0.096 108 M0.106 0.100 216 S ttest 0.000 0.000 0.001 0.003 0.003 0.007 0.014 L Wald 0.012 0.013 0.010 0.054 0.056 0.050 0.106 L0.000 0.001 216 WLS ttest ML W 0.010 0.011 0.011 0.012 0.009 0.009 0.050 0.052 0.053 0.049 0.053 0.047 0.100 0.102 0.103 0.104 0.097 0.099 216 ald β1 108 LS ttest 0.010 0.010 0.010 0.050 0.050 0.050 0.100 0.101 0.100 108 WLS ttest 0.011 0.011 0.010 0.052 0.051 0.051 0.102 0.102 0.101 108 ML Wald 0.012 0.012 0.011 0.055 0.053 0.053 0.106 0.104 0.103 216 LS ttest 0.010 0.010 0.010 0.050 0.050 0.050 0.100 0.100 0.100 216 WLS ttest 0.010 0.010 0.010 0.051 0.050 0.051 0.101 0.100 0.101 216 ML Wald 0.011 0.011 0.011 0.052 0.051 0.052 0.103 0.101 0.102 aDrerated odel e βi =stimre billioons nc oweestslteryis H1:he risk of type I ert a. NominTrue: ata we genefrom MA wher 0 and eations aased on 4 mn iteratiwith balaed data. Table 4. The pr for t with anative hpothesβ > 0 w 1re theror is se to 0.05 al: Cr.659 1.659 70itical value: 11.659 1.645 1.6 1.673 βtesald) LS LSteald 50 0.050 0.04 .06 .533 .562 .528 .551 0 1 LS (ttest) WLS (tt) ML (W (ttest)W (tst) ML (W) 0.00 0.050 0.052 0.055 0.050 0.0 0.02 0.150 0.151 0.160 0.150 0.149 0.154 0.321 0.324 0.344 0.321 0.320 0.334 00.530 00 0.530 00 0.08 0.721 0.723 0.753 0.721 0.720 0.744 0.10.857 0.859 0.882 0.857 0.856 0.876 0.12 0.936 0.938 0.952 0.936 0.936 0.949 0.14 0.975 0.976 0.983 0.975 0.975 0.982 0.16 0.991 0.992 0.995 0.991 0.991 0.994 0.18 0.997 0.997 0.998 0.997 0.997 0.998 0.20 0.999 0.999 1.000 0.999 0.999 1.000 0.22 1.000 1.000 1.000 1.000 1.000 1.000 aere generated from with difflues but cons 4.804. Estimioased on 1 mrations with baata and sam ple size n = 108. Data w Model Aerent β1 vatant β0 =atns are billion itelanced d Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 397 ras correct butLS and ML the critical v were adjusted to g = 0.05. adjustment still had the highestr while WLS now had a l pr compared to LS. The relative difference bet thominal and trufor β1 = 0 decreases with β1. similar to the geometric mean expo sure (0.345 against 0.386). Thus the data can be consid mod woodfire or been in a residence ods shwignificanrence betwe the LS LSregr models oothenburg differe indesbe the MLreon model also shwrence betwalmo and Lindesber= 0.041). the MLren model showed a significantly loposure fosmokers In ation is often used to stabilize the variance, but this distorts the linear relationship and does not give absolute effect of a predictor. isk w for Walues ive α*After ML poweower oweween e ne power was largest and 4.4. Application: 1,3Butadiene Exposure in Five Swedish Cities The data on 1,3butadiene were found to be highly skewed (γ = 2.764) whereas the logtransformed values were approximately symmetrical (γ = 0.279). The me dian exposure was ered lognormal. Five predictors were included in the el: “Have you lit a heated with wood burning?” (Wood burning, yes/no), “Are you a smoker?” (Smoker, yes/no), “Have you been in an indoor environment where people where smoking?” (ETS, yes/no), “Proportion of time spent in traffic?” (Traffic), and “City of residence”: Umeå (Ume), Stock holm (Sthlm), Malmö (Malmo), Gothenburg (Gbg), and reference category Lindersberg. These predictors had been shown to be significant in previous studies on other datasets [1517]. Neither Wood burning nor Traffic were significant in any of the regression models (for ML Traffic was bor derline significant, p = 0.059), Figure 3. All three meth oed a st diffeen the cities; in  and Wessionnly G d from Lrg, butgressi oed a significant diffeeen M g (p Only wer ex gressio r non with no ETS (p = 0.013). There were clear differences in range of the confidence intervals; with one exception ML had the narrowest intervals and LS the widest. A second analysis with only the nonsmokers (n = 225) was performed and then Traffic became significant in the MLregression model (p = 0.039). Otherwise the same predictors were significant in all three analyses, as for the full dataset. 5. Discussion this study a new maximum likelihoodbased method for estimating a linear regression model for lognormal heteroscedastic data was evaluated and compared with two least squares methods. For lognormal data, the logtransform an estimate of the Our simulation study demonstrated that the new maximum likelihood method (ML) provides unbiased estimates of the regression parameters βi, and so do the least squares method (LS) and the weighted least squares method (WLS). Figure 3. The LS, WLS and MLestimations of the predictor effects, βi, and their 95% confidence intervals (CI) for regression analysis with 1,3butadiene as the response. Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 398 One reason for proposing a new regression method is the need for a method to estimate a linear regression in the presence of heteroscedasticity. The effect of ignoring the increasing variance is demonstrated in the use of LS, where the sror, ˆi SE tandard er , was up to 78% larger than thate examples that were investigated. Since the heteroscedasticity of the data was ignored when using LS, the confidence interval was too wide. The results of our example with 1,3butadiene data were consistent with those in the simulation study; LS overall had the widest confidence intervals and ML the narrow est for the predictor effects. For all three methods, estimates of the expected re sponse, of ML, for th ˆYX , were unbiased. The confidence interval for ˆYX is based on ˆYX se μ, which depends on the value of tx0. For a model with one predictor, both ML produced an almost correct confi dence in the narrowest interval at he predictor, and WLS terval, with 00,1 xS 0 1 ˆˆ ESE , whereas LS had its nar at rowest co nfidence interval 0 X which therefore nfidence intervals. For ML and WLS the samplespecific standard error, does produce erroneous co ˆi se , was a good estimator of the true standard error, hereas for LS, w 0 ˆ se greatly overestimated ˆ SE 0 . When investigating the true risk of type I error, α*, for tests regarding the regression parameters, the ttests of the LSestimates of the intercepts produced an α* much smaller than the nominal α. This was a consequence of the too large 0 ˆ se , che distribution of the tstatistics to have fewer observations in the tails than expected. For both the Waldtest (ML) and the ttest used for the WLSestimates, the α* was approximately equal to α (for nominal size {0.01, 0.05, 0.10} the largest rela tive deviations were α* = {0.021, 0.072, 0.125}). For all three methods, the relative deviation ( ausing t * ) was largest at nominal size 0.01, indicating a skewness of the test st both reg inal atistics in the tails. Regarding the power (in tests of β1), ML was superior to LS,arding the true and nompower. The precision of an estimated expected value (ˆYX ) depends both on the regression method and on the varia tion of the predictors ,, 1 X. If a predic he estimated effect of that predictor will have a large standor. In some situa tions a better estimation of the exposure might tor has a very smn t ard err be achieve with variatio egressihe expo models can also be used to estimate the exposure for a whole population, provided that the distribution of the background factors is known. When investigating an exposuredisease associations, the measured exposure sometimes include a stochastic error (exposuremeas u r e d = exposuretrue + error1), which is unrelated to the stochastic error of the (linear) expo sureresponse model (response = δ0 + δ1·exposure + er ror2). The issue with regression analysis where the ex planatory variable has measurement errors is raised in [23]; given the true exposure, the measured exposure has a stochastic variation and different approaches to esti mating the true exposure are compared. In situations with measurement errors, the individual and groupbased exposure assessment approaches are often compared, see e.g. [24,25]. In an individualbased approach, each per son’s measured exposure is used which often leads to a bias of δ1 towards null. In a groupbased approach, each person is assigned an exposure based on group affiliation, which is the same as estimating the exposure from a re gression model with only one explanatory variable, Group. If several relevant explanatory variables are used ased ap rere di all variation, the a model that excludes predictors with very small n, as has been investigated in [22] for a situation where landuse ron is used to estimate t sure. Apart from effect estimation, regression analysis can be used to estimate the expected value of the re sponse variable given certain values of the background factors (e.g. to estimate the personal exposure as a func tion of easily measured background factors). Regression an individualbproach. A groupbased design of ten leads to errors in the exposure which are of Berk sontype (exposuretrue = exposuregroup + error3), or ap proximately Berksontype. As discussed in [24], in a true Berksonerrormodel, the erro r3 is independent of expo suregroup, whereas the approximate Berksonerror may depend of the group size. The approximate Berksonerror is, however, independent of error2. A groupbased ap proach often leads to less bias in the δ1 estimate but a larger standard error, see e.g. [24,26]. A groupbased approach with a loglinear exposureresponse model can have a substantial b to estimate the exposure (rather than only Group) the standard error will decrease and, thus, be more similar to ias in the estimated exposu sponse effect [27]. The evaluation of the new ML method, and the com parison to other regression methods, was conducted by a simulation study and exemplified on a dataset. The simulation study allowed us to assess, with great preci sion, the bias, standard error and the distribution of the parameter estimates, as well as the risk of type I error and the power of the methods. Because of the large number of iterations, our results on bias, standard error and the inference can be considered to be valid, for lognormal data where the relationship is linear. The simulation study allowed us to compare the parameters of the simulation model (“true β values”) to the estimates from each method and also to see how these estimates can vary. This would not have been possible had we only used empirical data sets, in which the “true” values are seldom known. Also, to get a reliable estimate of the stribution of the estimated parameters, a large number Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. 399 of data sets would be needed. In the simulation study, data were generated from a lognormal heteroscedastic distribution with certain pa rameter values. However, empirical data might be only approximately lognormal and thus exhibit a larger vari ance than the one in a perfect lognormal distribution. Thus the results from the simulation study might not hold completely for real data sets of e.g. exposure data. Also, in the simulation study the predictors were assumed to be measured without measurement errors which might be unrealistic for some predictors. Therefore, evaluation of the new ML method should also be made on several em pirical data sets. Two specific simulation models were used in this study where the parameters were estimated from an em pirical dataset and therefore the simulation models can be considered to be fairly realistic. The results on bias and standard error are likely to be valid also in models with other parameter values. The parameter estimates are un biased both for a simple model (with intercept and one explanatory variable) and for a model with three ex planatory variables and no intercept, and in both situa tions ML gives the smallest standard errors. This sug gests that the results can be generalized to other datasets. The ML estimates are derived true iterations and in some situations (sometimes caused by unsuitable start values or large variance in the data) the iterations don’t converges. This may lead to unrealistic estimations. In those cases, censoring data larger than the 95% quantile will stabilize the results but will cause underestimations of the intercept and standard errors. In those situations WLS will generally give the best result. In this study the methods have been evaluated for rela tively large samples (n = {108, 216}) and the conclusion was that the asymptotic properties of maximum likely hood estimates are valid for these sample sizes. However, for smaller samples, further evaluation is needed, espe cially regarding the distribution of the estimates. In other studies regarding smaller samples of untransformed log normal data, likelihoodbased approaches has been sug gested for constructing confidence intervals for the mean and mean response [28,29]. This study illustrated that the new maximum likely hoodbased regression method has good properties and can be used for linear regression on lognormal hetero scedastic data. It provides an estimate of the absolute effect of a predictor, while taking the increasing variance into account in an optimal way. The study also demon strated that the results from the weighted least squares regression (where the increasing variance is accounted for) were very similar to those from the ML method. Al though ML gave slightly narrower confidence intervals and had higher power regarding β1, both WLS and ML can be recommended for linear regression models for lognormal heteroscedastic data. 6. Acknowledgements We would like to thank Urban Hjort for sharing his MATLAB codes for the regression analysis. REFERENCES [1] P. O. Osvoll and T. Woldbæk, “Distribution and Skew ness of Occupational Exposure Sets of Measurements in the Norwegian Industry,” Annals of Occupational Hy giene, Vol. 43, No. 6, 1999, pp. 421428. [2] K. M. McGreevy, S. R. Lipsitz, J. A. Linder, E. Rimm and D. G. Hoel, “Using Median Regression to Obtain Adjusted Estimates of Central Tendency for Skewed Laboratory and Epidemiologic Data,” Clinical Chemistry, Vol. 55, No. 1, 2009, pp. 165169. doi:10.1373/clinchem.2008.106260 [3] R. Branham Jr, “Alternatives to Least Squares,” The As tronomical Journal, Vol. 87, 1982, pp. 928937. doi:10.1086/113176 [4] A. C. Olin, B. Bake and K. Toren, “Fraction of Exhaled Nitric Oxide at 50 mL/s—Reference Values for Adult Lifelong NeverSmokers,” Chest, Vol. 131, No. 6, 2007, pp. 18521856. doi:10.1378/chest.062928 [5] J. O. Ahn and J. H. Ku, “Relationship between Serum ProstateSpecific Antigen Levels and Body Mass Index in Healthy Younger Men,” Urology, Vol. 68, No. 3, 2006, pp. 570574. doi:10.1016/j.urology.2006.03.021 [6] L. Preller, H. Kromhout, D. Heederik and M. J. Tielen, “Modeling LongTerm Average Exposure in Occupa tional ExposureResponse Analysis,” Scandinavian Jour nal of Work, Environment & Health, Vol. 21, No. 6, 1995, p. 8. doi:10.5271/sjweh.67 [7] M. Watt, D. Godden, J. Cherrie and A. Seaton, “Individual Exposure to Particulate Air Pollution and Its Relevance to Thresholds For Health Effects: A Study of Traffic War dens,” Occupational and Environmental Medicine, Vol. 52, No. 12, 1995, pp. 790792. doi:10.1136/oem.52.12.790 [8] , pp. 60726092. doi:10.1002/sim.3427 B. Dickey, W. Fisher, C. Siegel, F. Altaffer and H. Azeni, “The Cost and Outcomes of CommunityBased Care for the Seriously Mentally Ill,” Health Services Research, Vol. 32, No. 5, 1997, p. 599. [9] R. Kilian, H. Matschinger, W. Loffler, C. Roick and M. C. Angermeyer, “A Comparison of Methods to Handle Skew Distributed Cost Variables in the Analysis of the Resource Consumption in Schizophrenia Treatment,” Journal of Mental Health Policy and Economics, Vol. 5, No. 1, 2002, pp. 2132. [10] J. P. T. Higgins, I. R. White and J. AnzuresCabrera, “MetaAnalysis of Skewed Data: Combining Results Re ported on LogTransformed or Raw Scales,” Statistics in Medicine, Vol. 27, No. 29, 2008 L. Hui, “Methods for Compar ing the Means of Two Independent LogNormal Sam [11] X.H. Zhou, S. Gao and S. Copyright © 2012 SciRes. OJS
S. M. GUSTAVSSON ET AL. Copyright © 2012 SciRes. OJS 400 ples,” Biometrics, Vol. 53, No. 3, 1997, pp. 11291135. doi:10.2307/2533570 [12] T. H. Wonnacott and R. J. Wonnacott, “Regression: A Second Courseew York, 1981. [13] Y. Yurgens, “tal Impact by Log r, L. Barregard s.7500562 in Statistics,” Wiley, N Quantifying Environmen Normal Regression Modelling of Accumulated Expo sure,” Licentiate of Engineering, Chalmers University of technology and Göteborg University, Gothenburg, 2004. [14] S. Johannesson, P. Gustafson, P. Molna and G. Sallsten, “Exposure to Fine Particles (PM2.5 and PM1) and Black Smoke in the General Population: Per sonal, Indoor, and Outdoor Levels,” Journal of Exposure Science and Environmental Epidemiology, Vol. 17, No. 7, 2007, pp. 613624. doi:10.1038/sj.je E. Jenkin, “Ambient[15] G. J. Dollard, C. J. Dore and M. Concentrations of 1,3Butadiene in the UK,” Chemico Biological Interactions, Vol. 135136, 2001, pp. 177206. doi:10.1016/S00092797(01)001909 [16] Y. M. Kim, S. Harrad and R. M. Harrison, “Concentra tions and Sources of VOCs in Urban Domestic and Public Microenvironments,” Environmental Science & Technol ogy, Vol. 35, No. 6, 2001, pp. 9971004. doi:10.1021/es000192y [17] P. Gustafson, L. Barregard, B. Strandberg and G. Sallsten, “The Impact of Domestic Wood Burning on Personal, Indoor and Outdoor Levels of 1,3Butadiene, Benzene, Formaldehyde and Acetaldehyde,” Journal of Environ mental Monitoring, Vol. 9, No. 1, 2007, pp. 2332. doi:10.1039/b614142k [18] U. Bergendorf, K. Friman and H. Tinnerberg, “Cancer Framkallande ämnen i TätortsluftPersonlig Exponering och Bakgrundsmätningar i Malmö 2008,” Report to the Swedish Environmental Protection Agency Department of Occupational and Environmental Medicine Malmö, 2010. [19] W. H. Greene and C. Zhang, “Econometric Analysis,” Vol. 5, Prentice Hall, Upper Saddle River, 2003. nd [20] J. Hass, M. D. Weir and G. B. Thomas, “University Cal culus,” Pearson AddisonWesley, 2008. [21] Y. Pawitan, “In All Likelihood: Statistical Modelling a Inference Using Likelihood,” Oxford University Press, Oxford, 2001. [22] A. A. Szpiro, C. J. Paciorek and L. Sheppard, “Does More Accurate Exposure Prediction Necessarily Improve Health Effect Estimates? [Miscellaneous Article],” Epi demiology, Vol. 22, No. 5, 2011, pp. 680685. doi:10.1097/EDE.0b013e3182254cc6 [23] Y. Guo and R. J. Little, “Regression Analysis with Co variates That Have Heteroscedastic Measurement Error,” Statistics in Medicine, Vol. 30, No. 18, 2011, pp. 2278 2294. doi:10.1002/sim.4261 [24] H. G. M. I. Kim, D. Richardson, D. Loomis, M. Van Tongeren and I. Burstyn, “Bias in the Estimation of Ex posure Effects with Individual or GroupBased Exposure Assessment,” Journal of Exposure Science and Environ mental Epidemiology, Vol. 21, No. 2, 2011, pp. 212221. doi:10.1038/jes.2009.74 [25] S. Rappaport and L. Kupper, “Quantitativ sessment,” Stephen Rapp e Exposure As aport, 2008. and S. Zhao, “Biases in Es [26] E. Tielemans, L. L. Kupper, H. Kromhout, D. Heederik and R. Houba, “IndividualBased and GroupBased Oc cupational Exposure Assessment: Some Equations to Evalu ate Different Strategies,” The Annals of Occupational Hygiene, Vol. 42, No. 2, 1998, pp. 115119. [27] K. Steenland, J. A. Deddens timating the Effect of Cumulative Exposure in LogLin ear Models When Estimated Exposure Levels Are As signed,” Scandinavian Journal of Work Environment & Health, Vol. 26, No. 1, 2000, pp. 3743. doi:10.5271/sjweh.508 [28] J. Wu, A. C. M. Wong and G. Jiang, “LikelihoodBased 1860. Confidence Intervals for a LogNormal Mean,” Statistics in Medicine, Vol. 22, No. 11, 2003, pp. 1849 doi:10.1002/sim.1381 [29] J. Wu, A. C. M. Wong and W. Wei, “Interval Estimation of the Mean Response in a LogRegression Model,” Sta tistics in Medicine, Vol. 25, No. 12, 2006, pp. 21252135. doi:10.1002/sim.2329
