^{1}

^{1}

^{2}

^{*}

^{1}

^{1}

This paper proposes new methods of estimating missing values in time series data while comparing them with existing methods. The new methods are based on the row, column and overall averages of time series data arranged in a Buys-Ballot table with m rows and s columns. The methods assume that 1 ) only one value is missing at a time, 2 ) the trending curve may be linear, quadratic or exponential and 3 ) the decomposition method is either Additive or Multiplicative. The performances of the methods are assessed by comparing accuracy measures (MAE, MAPE and RMSE) computed from the deviations of estimates of the missing values from the actual values used in simulation. Results show that, under the stated assumptions, estimates from the new method based on full decomposition of a series is the best (in terms of the accuracy measures) when compared with other two new and the existing methods.

The analysis of time series data constitutes an important area of statistics especially in identifying the nature of the phenomenon represented by the sequence. However, missing observations in time series data are very common [

In time series analysis, a problem frequently encountered in data collection is a missing observation. Missing observations may be virtually impossible to obtain, either because of time or cost constraints. In order to obtain estimates of these observations, there are different options available to the researcher. One of the options is to replace them by the mean of the series. The missing observation may be replaced with naive forecast or with the average of the last two known observations that bound the missing observation [

Using the Bode-Shannon representation of random processes and the “state-transition” method of analysis of dynamic systems, Kalman [

As the literature reveals, missing values in time series has attracted so much research attention. Several approaches to determine missing values like the use of ARIMA models as well as other techniques have continued to evolve. Among them are the optimal linear combination of the forecast and back forecast method [Damsleth [

Another easily applicable spectral estimator for missing data is the method of Scargle [

Brockwell and Davis [

Yuan et al. [

Cheema [

Indeed, procedures have been developed by statisticians to mitigate problems caused by missing data and various estimation methods have been reportedly used by different researchers to replace missing values [

In time series, it is assumed that the data consist of observations made sequentially in time; a systematic pattern (usually a set of identifiable components) and random noise (error). So, when some observations are missing it violets the condition for application of time series model. The systematic pattern includes the trend (denoted as T t ), seasonal (denoted as S t ) and the cyclical (denoted as C t ) components. The random noise (or error, irregular component) is denoted as I t or e t , where t stands for the particular point in time. These four classes of time series components may or may not coexist in real-life data. These components can adopt different specific functional relationship. They can be combined in an additive (additive seasonality) or a multiplicative (multiplicative seasonality) fashion and can as well take other forms such as pseudo-additive/mixed (combining the elements of both the additive and multiplicative models) model. The Additive model, Multiplicative model and Pseudo-Additive/Mixed Model are given in Equations (1.1)-(1.3) respectively:

X t = T t + S t + C t + I t , t = 1 , 2 , ⋯ , n (1.1)

X t = T t × S t × C t × I t , t = 1 , 2 , ⋯ , n (1.2)

X t = T t × S t × C t + I t , t = 1 , 2 , ⋯ , n (1.3)

Cyclical variation which refers to the long term oscillation or swings about the trend appears to an appreciable magnitude only in long period sets of data. However, if short period of time are involved (which is true of all examples of this study), the cyclical component is superimposed into the trend [

X t = M t + S t + I t , t = 1 , 2 , ⋯ , n (1.4)

X t = M t × S t × I t , t = 1 , 2 , ⋯ , n (1.5)

and

X t = M t × S t + I t , t = 1 , 2 , ⋯ , n (1.6)

The pseudo-additive model is used when the original time series contains very small or zero values. However, this work will discuss only the additive and multiplicative models.

Missing values can lead to erroneous conclusions about data. Substitution of missing values may introduce inaccuracies. It can lead to false results, forecast and errors or data skews can proliferate across subsequent runs causing a larger cumulative error effect. Most analytical methods cannot be performed if there are missing values in the data. Furthermore, existing methods did not consider the model structure (i.e. whether Additive or Multiplicative models) and other trending curves beyond the linear (Quadratic, Exponential etc.). More so, the seasonal component of the time series data was not taken into consideration in developing estimation methods as can be assessed from literature. Therefore the ultimate objective of this study is to develop methods of estimating missing values which take into consideration the model structure and trending curve. The specific objectives are:

1) To review existing methods of estimating missing values.

2) Develop new methods of estimating missing values in time series

3) Assess the performance of the methods of estimating missing values.

4) Compare results from the existing methods of estimation of missing values with results from the new methods developed using simulated data.

Based on the results, recommendations are made.

The rationale for this study is to fill the gap in the existing methods of estimation of missing values, by providing analyst with a better method for the estimation of missing values irrespective of model structure and functional relationship.

The new methods proposed in this study assumed that the series are arranged in a Buys Ballot

Some of the existing methods of estimating missing values in time series analysis are the Mean Imputation (MI), Series Mean (SM), Linear Interpolation (LI) and Regression Imputation (RI). Assuming an observation ( X ( i − 1 ) s + j ) is missing in the Buys-Ballot table at a point say t = ( i − 1 ) s + j , it is estimated using the different methods listed above as follows:

1) Mean Imputations (MI)

Mean imputation entails replacing the missing value with the mean of the values before the missing position. This is achieved by taking the summation of the values and dividing by the number of observation before the missing position.

MI = X ^ ( i − 1 ) s + j = 1 ( i − 1 ) s + j − 1 [ X 1 + X 2 + X 3 + ⋯ + X ( i − 1 ) s + j − 1 ] = 1 n * ∑ t = 1 n * X t (2.1)

where n * = ( i − 1 ) s + j − 1 is the number of observations preceding the missing observation.

2) Series Mean (SM)

Series mean estimates the missing value with the mean of the remaining series. Symbolically, the series mean is given by

SM = X ^ ( i − 1 ) s + j = T . . * n − 1 , (2.2)

where, n = m s and

T . . * = [ X 1 + X 2 + ⋯ + X ( i − 1 ) s + j − 1 + X ( i − 1 ) s + j + 1 + ⋯ + X m s ] (2.3)

3) Linear Interpolation (LI)

This method of linear interpolation for estimating missing values is given by

LI = X ^ ( i − 1 ) s + j = 1 2 ( X ( i − 1 ) s + j − 1 + X ( i − 1 ) s + j + 1 ) (2.4)

4) Regression Imputation (RI)

This method estimates the missing value by the estimate of the trend at the point of the missing value. Thus if the remaining values of the series are used to determine estimates of the trend parameters and the estimate of the missing value at ( i − 1 ) s + j is given as:

a) For Linear Trend

RI = X ^ ( i − 1 ) s + j = a ^ + b ^ [ ( i − 1 ) s + j ] (2.5)

b) For Quadratic Curve:

RI = X ^ ( i − 1 ) s + j = a ^ + b ^ [ ( i − 1 ) s + j ] + c ^ [ ( i − 1 ) s + j ] 2 (2.6)

c) For Exponential Curve:

RI = X ^ ( i − 1 ) s + j = b ^ e c ^ ( ( i − 1 ) s + j ) (2.7)

The new methods proposed in this work are the Row Mean Imputation, Column mean Imputation and Decomposition Without the Missing Value. The new methods are given as follows:

1) Row Mean Imputation (RMI)

The row mean imputation method computes the missing value as the mean of the remaining observations in the row (period) containing the missing value. Thus, the missing value is estimated by

RMI = X ^ ( i − 1 ) s + j = 1 s − 1 [ ∑ u = 1 j − 1 X ( i − 1 ) s + u + ∑ u = j + 1 s X ( i − 1 ) s + u ] (2.8)

2) Column Mean Imputation (CMI)

The columns mean imputation method computes estimate of the missing value as the mean of the remaining observations in the column (season) containing the missing value. Thus, the missing value is estimated as:

CMI = X ^ ( i − 1 ) s + j = 1 m − 1 [ ∑ u = 1 i − 1 X ( u − 1 ) s + j + ∑ u = i + 1 m X ( u − 1 ) s + j ] (2.9)

3) Decomposing Without the Missing Value (DWMV)

In this method, estimates of the trend parameters and seasonal indices obtained from the remaining observations using any of the methods of time series decomposition, are substituted into the expression for the missing value. Hence, the estimates of the missing values by this method are given by:

a) For Additive Model

X ^ ( i − 1 ) s + j = M ^ ( i − 1 ) s + j + S ^ j (2.10)

b) For the Multiplicative model.

X ^ ( i − 1 ) s + j = M ^ ( i − 1 ) s + j × S ^ j (2.11)

The trend-cycle components of the DWMV method for the linear, quadratic and exponential curves are:

i) Linear Trend

M ^ ( i − 1 ) s + j = a ^ + b ^ [ ( i − 1 ) s + j ] (2.12)

ii) Quadratic Curve

M ^ ( i − 1 ) s + j = a ^ + b ^ [ ( i − 1 ) s + j ] + c ^ [ ( i − 1 ) s + j ] 2 (2.13)

iii) Exponential Curve

M ^ ( i − 1 ) s + j = b ^ e c ^ [ ( i − 1 ) s + j ] (2.14)

To assess the performance of our estimation methods, accuracy measures are computed from the deviations of the estimates of the missing values from the actual values. The deviations of X ^ ( i − 1 ) s + j from the Actual value X ( i − 1 ) s + j is given as:

e ^ ( i − 1 ) s + j = X ( i − 1 ) s + j − X ^ ( i − 1 ) s + j (2.15)

The accuracy measures discussed are: Mean Absolute Error (MAE), Mean Absolute percentage Error (MAPE) and Root Mean Square Error (RMSE), (Makridakis and Hibon, 1995). Given a data set of size n = ms, we considered one missing value at a time for different m 0 < n positions, n > 1. These accuracy measures are defined as follows:

1) Mean Absolute Error (MAE)

The MAE is defined as

MAE = [ 1 m 0 ∑ k = 1 m 0 | e k | ] (2.16)

2) Mean Absolute Percentage Error (MAPE)

The MAPE is defined as:

MAPE = [ 1 m 0 ∑ k = 1 m 0 | e k X k | ] × 100 (2.17)

3) Root Mean Square Error (RMSE)

This is calculated as:

RMSE = 1 m 0 ∑ k = 1 m 0 e k 2 (2.18)

This section presents some empirical examples to illustrate the application of the methods of estimating missing values discussed in Section 2. The empirical example consists of both simulated and real life data. The simulated series used consists of 106 data sets of 120 observations each simulated from the Additive model: X t = M t + S t + e t , and Multiplicative model: X t = M t × S t × e t , using the MINITAB 16.0 version software. The trend-cycle component M t used are 1) Linear: M t = ( a + b t ) with a = 1 and b = 2.0, 2) Quadratic: M t = a + b t + c t 2 with a = 1, b = 2.0 and c = 3 and 3) Exponential: M t = b e c t with b = 10 and c = 0.02. In the Additive model, it is assumed that e t ~ N ( 0 , 1 ) , while in the Multiplicative model, it is assumed that e t ~ N ( 1 , σ 2 ) . The seasonal indices S j , j = 1 , 2 , ⋯ , 12 are as shown in

The summary of accuracy measures for the simulated Additive and Multiplicative models shown in

j | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|

S_{j} (Add.) | −0.89 | −1.22 | 0.1 | −0.15 | −0.09 | 1.16 | 2.34 | 1.95 | 0.64 | −0.73 | −2.14 | −0.97 |

S_{j} (Mult.) | 0.91 | 0.88 | 1 | 0.98 | 0.98 | 1.12 | 1.26 | 1.2 | 1.05 | 0.92 | 0.8 | 0.9 |

Note: S_{j}_{(Add)} is Seasonal indices for Additive model and S_{j}_{(Mult)} are Seasonal indices for Multiplicative model.

Trend Component | Accuracy Measures | Estimation Method | ||||||
---|---|---|---|---|---|---|---|---|

MI | SM | LI | RI | CMI | RMI | DWMV | ||

Linear | MAE | 67.03 | 62.29 | 7.07 | 11.26 | 63.96 | 11.89 | 2.59 |

MAPE | 48.84 | 99.28 | 5.79 | 11.39 | 101.56 | 11.71 | 2.24 | |

RMSE | 76.77 | 68.53 | 9.70 | 14.59 | 73.46 | 16.58 | 3.29 | |

Quadratic | MAE | 10,335.21 | 10,591.89 | 819.45 | 982.11 | 8561.86 | 851.07 | 386.83 |

MAPE | 66.44 | 524.74 | 5.66 | 10.75 | 437.68 | 17.55 | 2.74 | |

RMSE | 13,185.28 | 12,226.39 | 1221.91 | 1387.31 | 10171.44 | 1114.69 | 585.70 | |

Exponential | MAE | 21.61 | 22.07 | 2.37 | 3.43 | 20.74 | 4.24 | 1.03 |

MAPE | 39.03 | 88.17 | 5.87 | 11.20 | 75.78 | 10.62 | 2.18 | |

RMSE | 28.03 | 26.09 | 3.17 | 4.37 | 24.03 | 5.25 | 1.54 |

Trend Component | Accuracy Measures | Estimation Method | ||||||
---|---|---|---|---|---|---|---|---|

MI | SM | LI | RI | CMI | RMI | DWMV | ||

Linear | MAE | 61.43 | 55.92 | 1.25 | 36.17 | 63.95784 | 5.18 | 1.04 |

MAPE | 48.65 | 87.54 | 1.37 | 29.48 | 105.4748 | 8.79 | 1.35 | |

RMSE | 69.55 | 64.69 | 1.54 | 39.91 | 74.2791 | 5.87 | 1.14 | |

Quadratic | MAE | 9653.18 | 10,511.27 | 3.12 | 1.47 | 18,576.20 | 851.07 | 1.07 |

MAPE | 66.76 | 465.10 | 0.13 | 0.07 | 414.62 | 15.81 | 0.04 | |

RMSE | 12,547.42 | 12,120.50 | 3.48 | 1.87 | 388.72 | 1114.69 | 1.18 | |

Exponential | MAE | 20.03 | 22.05 | 1.25 | 1.42 | 18.47 | 2.25 | 0.95 |

MAPE | 36.87 | 87.12 | 4.57 | 6.67 | 71.05 | 6.18 | 3.49 | |

RMSE | 26.81 | 26.14 | 1.54 | 1.82 | 21.62 | 3.10 | 1.08 |

Trend Component | Accuracy Measures | Estimation Method | ||||||
---|---|---|---|---|---|---|---|---|

MI | SM | LI | RI | CMI | RMI | DWMV | ||

Linear | MAE | 83.43 | 77.63 | 15.05 | 17.92 | 72.03 | 26.75 | 7.56 |

MAPE | 29.22 | 43.30 | 6.69 | 9.18 | 31.51 | 12.95 | 4.32 | |

RMSE | 96.91 | 94.21 | 17.92 | 22.34 | 67.89342 | 32.69 | 10.49 |

the selected trending curves, followed by the LI. Each estimation method (in comparison with the others, using MAE, MAPE and RMSE) were consistent in their performance without being prone to minimal variations in the 106 data sets simulated for this study. This implies that the DWMV method of estimation of missing values yielded best (in terms of the accuracy measures) among other methods investigated in this work. This impressive observation may be attributable to the fact that DWMV combines the effects of both the trending curves and seasonal effect in estimating the missing values. The information that DWMV takes into account seasonality of the missing value is supported by literature. For the real life data, the DWMV method also out-performed the other methods of estimation of missing values even as the assumption of normal distribution of error terms is not met in real life data.

The results of the analysis indicate that for all trending curves and both model structures, DWMV yielded best (in terms of the accuracy measures) estimates of the missing values when compared with both the existing methods and the two other new proposed methods (RMI and CMI). This is perhaps, because DWMV combines the effects of both the trending curves and the seasonal indices unlike the other methods. Cheema [

In view of this, it is recommended that the DWMV method be used in estimating missing values in time series analysis when one observation is missing at a time until further studies proves otherwise. It is also recommended that this study be extended to cases where more than one point data are missing at a time and to examine the effects of different sample sizes and distributions on the estimation of missing values.

Iwueze, I.S., Nwogu, E.C., Nlebedim, V.U., Nwosu, U.I. and Chinyem, U.E. (2018) Comparison of Methods of Estimating Missing Values in Time Series. Open Journal of Statistics, 8, 390-399. https://doi.org/10.4236/ojs.2018.82025