Journal of Geoscience and Environment Protection, 2014, 2, 1626 Published Online June 2014 in SciRes. http://www.scirp.org/journal/gep http://dx.doi.org/10.4236/gep.2014.23003 How to cite this paper: Tan, S. Q. et al. (2014). A Statistical Model for LongTerm Forecasts of Strong Sand Dust Storms. Journal of Geoscience and Environment Protection, 2, 1626. http://dx.doi.org/10.4236/gep.2014.23003 A Statistical Model for LongTerm Forecasts of Strong Sand Dust Storms Siqi Tan1, Moinak Bhaduri2, ChihHsiang Ho2* 1Covance Pharmaceutical Research & Development Co. Ltd, Beijing, China 2Department of Mathematical Sciences, University of Nevada, Las Vegas, USA Email: *chho@unlv.nevada.edu Received February 2014 Abstract Historical evidence indicates that dust storms of considerable ferocity often wreak havoc, posing a genuine threat to the climatic and societal equilibrium of a place. A systematic study, with empha sis on the modeling and forecasting aspects, thus, becomes imperative, so that efficient measures can be promptly undertaken to cushion the effect of such an unforeseen calamity. The present work intends to discover a suitable ARIMA model using dust storm data from northern China from March 1954 to April 2002, provided by Zhou and Zhang (2003), thereby extending the idea of em pirical recurrence rate (ERR) developed by Ho (2008), to model the temporal trend of such sand dust storms. In particular we show that the ERR time series is endowed with the following charac teristics: 1) it is a potent surrogate for a point process, 2) it is capable of taking advantage of the well developed and powerful time series modeling tools and 3) it can generate reliable forecasts, with which we can retrieve the corresponding mean number of strong sand dust storms. A simula tion study is conducted prior to the actual fitting, to justify the applicability of the proposed tech nique. Keywords ARIMA Model, Empirical Recurrence Rate, ERR Plot, Point Process, Time Series 1. Introduction Whimsical and extreme weather patterns have continued to baffle mankind since time immemorial. The past decade has witnessed an unprecedented loss of lives and property primarily due to our inability to adequately foresee weather demons. It’s high time for modelers to come up with new and efficient techniques replacing traditional ones, to tame the grave consequences—a need which is now felt more badly than ever. The focus of the present study would be on one such bane of human civilization: sand storms. But the methods pro posed can be easily extended to other unforeseen climatic dangers like thunderstorms or hurricanes. Lack of a general consensus among researchers in adequately defining a dust storm leads us to use the following defini tion due to Tao et al. (2002), specific to Inner Mongolia, China: *
S. Q. Tan et al. Strong dust storm: at least three stations reporting with horizontal visibility of less than 500m and an average wind speed of 17.2 to 24.4 m/s. Very strong dust storm: at least one station reporting with horizontal visibility of less than 50m and an aver age wind speed of 20.8 m/s or greater. These storms can inflict far reaching and grave consequences, including, but not limited to climatic change, chemical and biological alterations in oceans, affecting crop growth and worsening the general survival condi tions. Neutralization of acid rain is a positive effect that results. Several climatic factors such as wind, relative humidity, air temperature etc. contribute to the occurrence of these storms and this paper concentrates on storms originating from Hexi Corridor of Gansu Province and Alxa Plateau, southern rim of South Xinjiang Basin and central Inner Mongolia from 1954 to 2002. Thomas T. War ner (2004), Zhou and Zhang (2003), Yang et al. (2007), and Zhang et al. (2002) provide a gripping account of the changes in their frequency patterns over the last 50 years. By developing an empirical recurrent rate (ERR) time series, we propose a new treatment to smooth the point process. The ERR is computed sequentially and cumulatively at equidistant time intervals and then we explore the possibility of fitting an ARIMA model to develop reliable and robust forecasts. Designed simulations give valuable insights for building the final model by helping us to mimic the real data. We intend to proceed as follows: The first section lays out the new definitions we propose, and outlines the rudiments of ARIMA modeling along with rules to help us choose a good model from several competing ones. The next section implements the ideas on a small scale, using a simulated data set, to judge the merit of the pro posed methods. Encouraged by the success, we next move on to apply similar methods on the real data that we gathered, but firstly using a partitioned version that enables us to test the prediction capabilities and finally, in search of a more reliable and global model, to the entire data set. The necessity of ignoring a few values towards the beginning of the series is highlighted in the section on prediction sets. Encouraging forecasts are obtained throughout the study and finally we briefly detail how can these methods be extended to similar fields. 2. Theory and Method 2.1. Empirical Recurrence Rate Treating as the time origin and h as the time step, we generate a time series of the ERR, defined by: ( 00 no of storms in ,] l l t tlh n zlh lh + = = (1) at the equidistant time intervals 00 0 ,...,,..., (0, the present time).thtlhtNh+ ++= We observe that evolves over time and is the MLE of the mean, if the underlying process in is a homogeneous Poisson process. With a forecast horizon of , we need to predict the value of based on the finite history: . The untenable classical regression assumption of the independence of the time evolving observations necessitates the introduction of ARIMA models. 2.2. ARIMA Model A general autoregressive moving average (ARMA) model of order is built on the premise that the cur rent value of a mean zero stationary time series is a linear combination of “p” of its immediate past vales and “q” of its past noise terms and is thus a generalization, embracing the usual AR and MA models. It is thus given by: 11 pq tititjt j ij X XZZ φθ −− = = −=+ ∑∑ (2) where is a Gaussian white noise with mean zero and variance . The coefficients are the AR and MA parameters, respectively. The construction of an ARIMA model proceeds through three stages: 1) Identification: Initial estimates for q, p, d are obtained through the sample autocorrelation function (ACF) or the sample partial autocorrelation (PACF) plots or by using goodnessoffit criteria to choose the best among several competing models.
S. Q. Tan et al. 2) Estimation: We use the maximum likelihood method to estimate the unknown parameters in the model. 3) Diagnostic checking: Tests for normality using the residuals and significance of the model parameters. Honoring the principle of parsimony, we’ll choose the model that has the lowest number of parameters. 2.3. Data Pretreatment We will use the method of cross validation to partition the data set into two blocks, so that one can be used for model building purposes and the other can be used as refinement tools. In addition, transformations on the data to guarantee stationarity and zero centrality of a time series can be achieved through: 1) Box Cox Transformation: For a given and positive observations , it is defined as: ( )( ) 1 if 0 log if 0 y fy y λ λ λ λ λ −≠ = = (3) This transformation helps to stabilize variance, imperative to achieve stationarity. 2) Differencing: A time series can be detrended by considering the dth order difference, for a suitable d: (4) where is the usual backshift operator. A lag d difference operator (5) can remove the seasonality and trend of period d. A suspected nonstationarity can be tackled by the transformation and an appropriate order of differencing, which can be dictated by the graph of the ACF function. A quick decay indicates stationarity. 3) Centering: Once stationarity is achieved, we can subtract the sample mean of the transformed data from each observation to get a mean zero process. 2.4. Model Diagnostics Asymptotically, the sample autocorrelations of an independent and identically distributed sequence with finite variance are approximately iid following . Thus the usual normal bounds can be used to check whether the observed residuals are consistent with the iid noise assumption. A statistical test for the ran domness of the residuals can be done through the wellestablished LjungBox (1978) test statistic. Finally, we can choose between different competing models through the use of the AIC, AICC and BIC criteria. 2.5. Forecasting Predictions for the future will be achieved from forecasting functions of the form: (6) by forecasting the residuals and then inverting the transformations adopted, to arrive at the forecasts of the original series. Models will be put to test by comparing their generated forecasts with the prediction set and fi nally, we will combine both the training and the prediction set to obtain reliable forecasts from the chosen model. 2.6. Subset Model Checking For causal models (e.g., Brock well et al., 2002), the coefficients will be reported as the ratio of each estimate to 1.96 times its standard error. This denominator is its critical value, at 5% level. Thus, a ratio with magnitude less than unity will raise the suspicion that the corresponding coefficient might be nonsign if i c ant and we might set
S. Q. Tan et al. tle for a smaller subset model, always lucrative from an interpretative viewpoint. This simpler model, in turn will then be subjected to the usual model selection processes. 3. Simulation Study The number of monthly sandstorms from March 1954 to April 2002, (Zhou & Zhang, 2003) reviews the pres ence of too many zeroes in the time series, which questions the justifiability of treating it as a Poisson process. Ensuring stationarity, thus necessitates proper smoothing methods. Prior to working with the actual data, we intend to investigate the performance of the proposed methods based on a simulated data set, consisting of 17 repetitions of the randomly selected year 1996 from Zhou and Zhang (2003). We converted the data to an ERR time series and used the techniques detailed in the previous section based on a training sample of the first 15 years and a prediction set of the last 2 years. The raw ERR (in 1 month) series was seen to be plagued by significant nonstationarity and periodicity from its sample ACF and PACF graphs. So a BoxCox transformation (with ) and lag 12 differencing were applied and finally, the resulting series was detrended by an additional lag 1 difference. The ACF and PACF plots (Figure 1) reveal that stationarity is clearly attained. To work with a mean zero model, the sample mean of the transformed data set was subtracted from each observation and an MA(1) model was seen adequate from our initial model search. The estimated (MLE) model is given by: (7) with an estimated white noise variance of 0.049 and a standard error of the MA coefficient as 0.089. Here, represents a twicedifferenced mean corrected time series and represents a white noise process. The AIC statistic is −25.78 and the LjungBox test with a high pvalue of 0.94 indicates that the residuals are approximately white noise. That this model is successful from a prediction view point, can be observed from Figure 2. It predicts a mean number of 14 sand storms in April for the two prediction years which is fairly close to the actual number. Table 1 describes this too. Table 1. Comparison of the actual and predicted ERRs from the simulation study. Predicted month 2 1.319 1.307 0 0 3 1.311 1.294 0 0 5 1.383 1.363 14 13.890 7 1.368 1.337 0 0 10 1.347 1.300 0 0 13 1.326 1.256 0 0 15 1.312 1.220 0 0 17 1.380 1.281 14 13.321 18 1.374 1.263 0 0 20 1.360 1.227 0 0 22 1.346 1.191 0 0
S. Q. Tan et al. Figure 1. Sample ACF and PACF plots showing that stationarity is attained. This shortscale simulation study gives enough evidence of the merit of the proposed method and paves the way for a more detailed real data analysis, which we intend to carry out on an exactly similar vein.
S. Q. Tan et al. Figure 2. Comparison of the observed and predicted ERRs from the simulation study. 4. Real Data Analysis Using the data source provided by Zhou and Zhang (2003), we intend to carry out similar techniques on the 908 sand storms that occurred in 578 months, from March, 1954 to April 2002, using a time step of 1 month. 4.1. Model Fitting Attempts to model the sand storm data through ARIMA techniques, by taking the BoxCox transformation and differencing at lags 12 and 1 failed, since stationarity was difficult to achieve. An ARIMA modeling using a noncumulative ERR approach proved futile too. Here, we calculated the ERRs for each of the 55 years and were able to reduce the number of zeroes, but since they were calculated independently for each year, there was a huge fluctuation in the series and stationarity was elusive even through the usual transformations. Thus, we took recourse to the smoothing methods using ERR detailed earlier, dropping the first 13 months, to tame the otherwise widely varying series. It might be observed at this point that a valid data set is an extremely precious component in any modeler’s arsenal and deletion of a subset of it often obscures an otherwise imminent inference. But, in this case, efforts to otherwise circumnavigate this problem of a large initial variance, proved futile. Another justification for the data deletion might be from the relative position of the subset deleted. In real life studies like this, it is only natural for the series to take some time to stabilize and hence the first few errant values are not too likely to exert significant influence on our final inference. We worked with a modified data set of 565 months of sand storms (April 1955April 2002), where the first 537 formed the training sample and the last 28 played the role of a prediction set. Encouraged by the success of the model it generated, we carried out a similar exercise to the one described with the simulated data set, taking clues from the sample ACF and PACF graphs at each stage and using a BoxCox transformation (with ) and similar lags of 12 and 1, we have our ARMA (1,1) (MLE) model as: 11 0.995 0.2688 t ttt X XZZ −− = ++ (8) with an estimated white noise variance of 0.29, standard error of AR coefficients as 0.0004 and a standard error of MA coefficients as 0.05. The AIC statistic was 842.38, but the LjungBox test gave a significant pvalue of approximately zero, which indicated that the residuals cannot be treated as white noise. We explored the possibility of working with a sim pler model and found that for the ARMA (1,1) model, the ratio of the AR coefficient was 1.008 and that for the MA coefficient was −0.451 which is less than 1. So a simpler AR(1) model seemed logical and the estimated (MLE) model turned out to be: (9)
S. Q. Tan et al. with an estimated white noise variance of 0.29 and a standard error of the AR coefficient as 0.003. This time, the LjungBox test gave a nonsignificant pvalue of 0.155, which showed that the residuals are approximately white noise. Figure 3 below confirms this using the appropriate residual plots. Figure 4 shows the comparisons between the actual and predicted ERRs. Table 2 does this comparison too and shows that the ERR predictions are rather close to the actual values. 4.2. Full Data Forecasting Next, we proceeded to combine the training and the prediction set to find a good final model based on all the available 565 points from April 1955 to April 2002. Similar transformations (with the same coefficient and lag values) gave the initial estimated ARMA (1, 1) (MLE) model as: Figure 3. Diagnostics for Model 1.
S. Q. Tan et al. Figure 4. Comparison of the observed and predicted ERRs from Model 1. Table 2. Comparison of the actual and predicted ERRs from the real data analysis. Predicted month Monthly ERR Mean number Actual Prediction Actual Prediction February 1.468 1.468 0 0.122 March 1.465 1.465 0 0.190 April 1.482 1.473 11 5.367 May 1.487 1.478 4 4.404 June 1.484 1.476 0 0.373 July 1.482 1.474 0 0.441 August 1.479 1.472 0 0.504 September 1.476 1.470 0 0.572 October 1.473 1.469 0 0.629 November 1.471 1.467 0 0.698 December 1.472 1.466 2 0.761 January 2001 1.469 1.465 0 0.884 February 1.466 1.464 0 1.012 March 1.473 1.463 5 1.130 April 1.562 1.473 51 6.384 May 1.567 1.480 4 5.461 June 1.564 1.480 0 1.501 July 1.561 1.480 0 1.6 August 1.558 1.481 0 1.744 September 1.556 1.481 0 1.87 October 1.553 1.482 0 1.994 November 1.55 1.484 0 2.113 December 1.547 1.485 0 2.238 January 2002 1.544 1.487 0 2.423 February 1.542 1.489 0 2.602 March 1.551 1.491 7 2.782 April 1.569 1.503 12 8.035 11 0.99 0.63 t ttt X XZZ −− = ++ (10) with an estimated white noise variance of 0.38, the standard errors of the AR and MA coefficients being 0.003 1.4 1.42 1.44 1.46 1.48 1.5 1.52 1.54 1.56 1.58 135 7911 13 1517 1921 23 25 27
S. Q. Tan et al. and 0.033, respectively. But, once again, the LjungBox test with a significant pvalue of approximately zero re jects the hypothesis that the residuals are approximately white noise. (AIC = 1051.94). An identical search for a simpler model led us to opt for AR(1) (since the ratio of the MA coefficient in the ARMA (1,1) model was −0.44, a value less than 1) and the estimated (MLE) model was: (11) with an estimated white noise variance of 0.47 and a standard error of the AR coefficient as 0.003. A nonsignificant pvalue of 0.75 from the LjungBox test now assures that the residuals can be treated as white noise, further confirmation of which was provided by the plots of the residual ACF and PACF (AIC = 1155.07). The forecasts from the full data are shown in Table 3 below. The prediction set plays a pivotal role in model selection, because two models might fit the training data equally well, even with rather different prediction capabilities. This set lets one know the consequences of de leting the first few data points (where we had high variation), called the burnin period. To understand this, we recalculate the ERR figures based on the entire data set and delete the first one to arrive at an estimated AR(1) model given by: (12) Here, the first 549 points serve as the training set and the last 28 as the prediction set. Defining this as Model 2, we compared its forecasting ability with Model 1 (9), (which was obtained by deleting the first 13 data points), in Figure 4 and Figure 5. The forecasts from Model 1 appear to be more realistic in showing the seasonality of the sand storms in Northern China. Table 3. Predictions from the final model. Predicted month Monthly ERR Mean number Prediction Prediction (Adjusted) May2002 1.574 3.934 June 1.571 0 July 1.568 0 August 1.565 0 August 1.562 0 September 1.559 0 October 1.556 0 November 1.553 0 December 1.550 0 January2003 1.547 0 February 1.556 6.990 March 1.574 12.086 April 1.578 3.863 May 1.575 0 June 1.572 0 July 1.569 0 August 1.565 0 September 1.562 0 October 1.559 0 November 1.556 0 December 1.553 0 January2004 1.550 0 February 1.559 6.983 March 1.577 12.184
S. Q. Tan et al. Figure 5. Comparison of the observed and predicted ERRs from Model 2. 5. Conclusions This work successfully finds an acceptable ARIMA model for the sand dust storms occurring in Northern China, through a judicious use of empirical recurrence rates (ERRs). These rates were calculated at equidistant time points and created a useful connection between classical time series and a point process that enabled us to effi ciently model a random phenomenon that clearly doesn’t follow a Poisson process and hence, otherwise seemed hopelessly intractable, through traditional techniques. We hope that this paper emphasizes the fact that ERR is an effective way to handle a large class of meteorological data that exhibit similar seasonality and we strongly believe that identical ideas can be profitably extended to other fields like biology, economics and social science. It is comforting and encouraging to notice that similar techniques have been profitably used by researchers (e.g., Amei, et al., 2012; Ho, 2010) to model identical natural calamities like large scale earthquakes and volcanic eruptions for longterm predictions. Not only should these evidence provide further confirmation of the applica bility of this new technique, but should also stimulate serious research investigating fresher avenues like the one proposed. References Amei, A., Fu, W., & Ho. CH. (2012). Time Series Analysis for Predicting the Occurrences of Large Scale Earthquakes. In ternational Journal of Applied Science and Technology, 2, 6475. Box, G. E. P., & Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control. San Francisco: HoldenDay. Brockwell, P. J., & Davis, R. A. (2002). Introduction to Time Series and Forecasting (2nd ed.). New York: SpringerVerlag. http://dx.doi.org/10.1007/b97391 Cowpertwait, P. S. P., & Metcalfe, A. V. (2009). Introductory Time Series with R. New York: Springer. Crye r, J. D., & Chan, K. S. (2008). Time Series Analysis with Applications in R (2nd ed.). New York: Springer. http://dx.doi.org/10.1007/9780387759593 Goudie, A. S., & Middleton, N. J. (1992). The Changing Frequency of Dust Storms through Time. Climatic Change, 20, 197225. http://dx.doi.org/10.1007/BF00139839 Ho, C.H. (2008). Empirical Recurrent Rate Time Series for Volcanism: Application to Avachinsky Volcano, Russia. Jour nal of Volcanology and Geothermal Research, 173, 1525. Ho, C.H. (2010). Hazard Area and Recurrence Rate Time Series for Determining the Probability of Volcanic Disruption of the Proposed HighLevel Radioactive Waste Repository at Yucca Mountain, Nevada, USA. Bulletin of Volcanology, 72, 205219. http://dx.doi.org/10.1007/s0044500903093 Li, W. K. (2004). Diagnostic Checksin Time Series. Florida: Chapman& Hall/CRC. Shumway, R. H., & Stoffer, D. S. (2006). Time Series Analysis and Its Applications with R Examples. New York: Springer. Tao, G., Jingtao, L., Xiao, Y., Ling, K., Yida, F., & Yinghua, H. (2002). Objective Pattern Discrimination Model for Dust Storm Forecasting. Meteorological Applications, 9, 5562. http://dx.doi.org/10.1017/S1350482702001068 Warner, T. T. (2004). Desert Meteorology. New York: Cambridge University Press. Yang, B., Brauning, W., Zhang Z. Y., Dong Z. B., & Esper, J. (2007). Dust Storm Frequency and Its Relation to Climate Changes in Northern China during the Past 1000 Years.Atmospheric Environment, 41, 92889299.
S. Q. Tan et al. http://dx.doi.org/10.1016/j.atmosenv.2007.09.025 Zhang, Q. Y., Zhao, X. Y., Zhang, Y., & Li, L. (2002). Preliminary Study on San dDust Storm Disaster and Countermea sures in China. Chinese Geographical Science, 12, 913. http://dx.doi.org/10.1007/s1176900200642 Zhou, Z. J., & Zhang, G. C. (2003). Typical Strong Sand Storm Events in the Northern China. Chinese Science Bulletin, 48, 12241228.
