Paper Menu >>
Journal Menu >>
|  J. Intelligent Learning Systems & Applications, 2009, 1, 1-27  Published Online November 2009 in SciRes (www.SciRP.org/journal/jilsa)  Copyright © 2009 SciRes                                                                                JILSA  Towards Real-World Applications of Online Learning  Spiral Recurrent Neural Networks  Rudolf SOLLACHER1, Huaien GAO2  1Siemens AG, Corporate Technology, Munich, Germany; 2Department of Computer Science, Ludwig-Maximilians University, Mu- nich, Germany.  Email: Rudolf.Sollacher@siemens.com, gheken@hotmail.com  Received October 21st; 2008; revised January 19th, 2009; accepted January 21st, 2009.  ABSTRACT  Distributed intelligent systems like self-o rganizing wireless sensor and actuator networks are suppo sed to work mostly  autonomous even und er changing  enviro nmenta l conditio ns. This requires ro bust and efficient self-learning  capabilities  implemen table on embedded sys tems with limited memory and computational power. We present a new solutio n called  Spiral Recurrent Neural Networks (SpiralRNN) with an online learning based on an extend ed Kalman filter and gradi- ents as in Real Time Recurrent Learning. We illustrate its stability and performance using artificial and real-life time  series and compare its prediction performance to other approaches. SpiralRNNs perform very stable and show an ac- curacy which is superior or similar to other state-of-the-art approaches. In a memory capacity evaluation the number  of simultaneously memorized and accurately retrievable trajectories of fixed length was counted. This capacity turned  out to be a linear function of the size of the recurrent hidden layer, with a memory-to-size ratio of 0.64 for shorter tra- jectories and 0.31 for longer trajectories. Finally, we describe two potential applications in building automation and  logistics and report on an implementation of online learning SpiralRNN on a wireless sensor platform under the  TinyOS embedded operating system.   Keywords: recurrent neural network, online learning, prediction, sensor actuator network 1. Introduction  Visions like the “intelligent factory” [1] or “ambient in- telligence” [2] rely heavily on embedded systems with a  high degree of autonomy. Of particular interest are en- ergy autarkic self-organizing wireless sensor networks  due to their simplified installation and robustness during  operation. These consist of so called “sensor nodes”,  hardware platforms with usually a battery, a microproc- essor, a transceiver module and one or more sensors.  They are able to communicate wirelessly with each other;  suitable protocols enable them to set up a network by  themselves. If necessary, data are relayed by other nodes  to a destination. Data may also be processed locally or  inside the network.   A very useful kind of data processing is forecasting  data. Examples for applications in building automation  and logistics are given at the end of this paper. Solutions  for this task have to fulfill at least the following require- ments in order to be really useful:   R1 The limited computational power and memory of  these sensor nodes puts a limit on the computa- tional complexity of the forecasting solution.   R2 As the environmen t of the sensor networ k is usu- ally unknown in adv ance th e solution  has to learn  a forecast model online and in an efficient way.   Neural networks, and in particular recurrent neural  networks, have pro ven their suitability at least for offline  learning forecast tasks. Examples can be found in [3] or  [4]. However, traditional approaches like Elman or stan- dard recurrent neural networks (SRN) [5], time delay  neural networks (TDNN) [6], block-diagonal recurrent  neural networks (BDRNN) [7] or echo state neural net- works (ESN) [8] have deficiencies with respect to the  requirements above. For a TDNN one has to specify the  length of the history to be considered as input. If this is  chosen too small, some time series like spike time series  with large inter-spike interv als cannot be predicted relia- bly. Elman networks avoid this problem due to their re- current hidden layer; this allows them to adjust their  short term memory in a data driven way by online learn- ing. However, their basically unbounded recurrent  weight matrix sometimes leads to dynamically unstable  behavior making them less attractive for online learning   RUDOLF SOLLACHER, HUAIEN GAO  2   purposes. Echo state networks try to cure this deficiency  by choosing random but constant weights for the recur- rent weight matrix, but such that its ab solute eigenvalues  are smaller than one; only the weights of the outpu t layer  are trained. Unfortunately, with this approach the recur- rent layer has to be large enough in order to provide a  rich reservoir of responses from which to construct the  output signals. Block diagonal recurrent neural networks  try to cure the dynamical instabilities by a specific struc- ture of the recurrent layer. The corresponding weight  matrix is zero except for 2×2 diagonal block matrices.  These blocks can be trained but there is a constraint on  their maximum eigenvalue. This constraint is either im- plemented implicitly via training or explicitly via a suit- able parametrization of the weights. For online learning  the scaled orthogonal version described in [7] is more  useful. However, this structure seemed to be too con- strained leading to poor performance for some prediction  problems. This motivated us to develop a new approach  called Spiral Recurrent Neural Network (SpiralRNN)  [9].   2. Spiral Recurrent Neural Networks  (SpiralRNN)  2.1 Architecture  Like other Recurrent Neural Network (RNN) models,  SpiralRNN consists of an input layer feeding into a re- current hidden layer and an output layer which receives  input from the recurrent hidden layer. The hidden neuron  states t and the output neuron states  satisfy the  following u pdate equations:   st x 11 () thidtinth gW W  ssxb id ) ( tout tout hWxsb                 (1 )  Here,  and  are bias vectors, matrices,   and  represent synaptic connections between  layers and  hid b out W out bin W hid W () g  and  are activation functions. Us- ing the data  instead of the previous network output   as input in Equation (1) corresponds to teacher  forcing; this is used when ever data are available.   ()h 1 ˆ xt 1t x The main difference to other approaches lies in the  structure of the recurrent hidden matrix . Motivated  by the analysis underlying the development of echo state  neural networks (ESN) [8], we have chosen a constraint  on the weights of the recurrent layer in order to guaran- tee stability, respectively the “echo state” property, while  still allowing the weights to be trained. With the follow- ing structure,   hid W 11 1 1 1 1 0... 0 …0 l hid l lll W                           (2)  with l the number of hidden nodes in the recurrent layer,  tanh( )[11] kk kl    ,k  trainable parameters  and a suitable fixed , one can guarantee a  bounded eigenvalue spectrum of  and thus stability  in the sense of the “echo state” property:    R Whid 11 11 ()(1 ll kk kk tanh l      )       (3)  The structure of  is best illustrated with Figure 1,  also giving an idea for the choice of the name “Spi- ralRNN”.  hid W It turned out that a block structure in case of several  inputs is a good compromise concerning computational  complexity and performance (see Figure 2). Although  the recurrent hidden layer has this block structure, all  input neurons are coupled to all hidden neurons; the  same holds f o r  t he  out put neurons.  2.2 Online Learning  Due to the second requirement mentioned in the intro- duction the learning of the neural network weights has to  be online. We understand online learning as a continuous  update of weights controlled only by the data. Realtime  Recurrent Learning (RTRL) is based on an approximate  error gradient taking into account all previous steps; it  sometimes shows slow or poor convergence. Particle  filtering and evolutionary approaches require much memory  Figure 1. Structure of SpiralRNN recurrent hidden layer  for 1-dimensional input; only the connections rooted in one  neuron are shown  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  Copyright © 2009 SciRes                                                                                JILSA  3 Figure 2.SpiralRNN for 3-dimensional input; the recurrent hidden matrix is split into 3 blocks but all other layer s are fully  connected   The last line in Equation (6) is the gradient calculated  in real time recurrent learning (RTRL) [10]. This ap- proximation is justified if the weight vector is changing  slowly which holds at least for the later phase of the  converging learning process.   in order to provide reasonable behavior. As explained in  [9] the optimal choice is an ex tended Kalman filter (EKF)  based on gradients calculated like in Realtime Recurrent  Learning (s. appendix).   According to this derivation, the iterative update rule  for the weight vector , representing all weight matri- ces ,  and   and all bias vectors  and  , is the following:   t w out W in Whid Whid b out b It is worth mentioning that these iterative Kalman fil- ter equations for the update of expected weight values  and their covariance matrix are derived analytically un- der three reasonable approximations: 1) the distribution  of noise on the data is Gaussian, 2) the weights change  slowly and 3) the distribution of weights representing  their uncertainty is unimodal and also Gaussian. While 1)  is a standard assumption, the other two approximations  deserve some explanation: 2) is true if the process gener- ating the data is stationary and if the model is powerful  enough to reproduce the data; only in the initial phase of  learning and for very non-stationary processes this as- sumption might be questionable. Usually, 3) is only true  for a local neighborhood around the true weight vector;  in general, the distribution will be multi-modal implying  several competing weight vectors and thus models.  However, for complexity reasons we have to focus on  one set of weights; we have no guarantee that we get the  optimal model but we will find a “locally” optimal  model in the sense, that small changes of the weights  lead to a decrease in performance. As a consequence we  will end up with a distribution of different models in  repeated experiments with different initial conditions or  noise. The reader should notice, that there will be always  several degenerate solutions, because a renumbering of  †1 †† †1 1 1 () ww( x) ˆ tt t TT tt tttt Tt tt ttt PPQ PPPPRP PR x               (4)  t x  is the output vector of the Spiral RNN and  are  the data. Pt is the covariance matrix of  representing  its uncertainty and  is the incremental increase in this  uncertainty.  is the measurement error covariance  matrix.  represents the gradient of the output   w.r.t.  (notice the indexing):   ˆt x t w t Q t R  1t t x w 11 ttt tttt dd dd      xxs x wsww 1 t          (5)  2 1 11 1 1 111                   t t t t t t t t t t t t t t dw ds s s w s dw ds s s w s dw ds             (6)   RUDOLF SOLLACHER, HUAIEN GAO  4   the hidden neurons yields exactly the same dynamical  model but the weight vector is changed; the components  have just been reshuffled. This is actually an advantage  for learning because it is sufficient to reach one of these  degenerate solutions.   2.3 Performance  Spiral RNN have been compared to other architectures  like ESN, Elman networks and BDRNN [9]. TDNN is  not included due to the need to fix the number of historic  inputs in advance. The generalization ability of these  networks was tested with three artificial time series: A  spike time series with spike intervals of 21 time steps (20  zeros followed by a one), and the chaotic Mackey-Glass  and Lorenz time series. The one-dimensional Mackey-  Glass time series is given by   () () () 1( ) c ax t x t xt      bxt           (7)  with the parameters a = 0.2, b = 0.1, c = 10 and   =17.  The differential equation was solved numerically using  Euler’s method with step size 1. All time series values  before start time were set to 0.5. The time series was  normalized by its standard deviation.  The three-dimensional Lorenz time series is defined  by the following equations:   () x sy x                 (8)  ()yrzxy                 (9)  zxybz                    (10 )  The parameters were set to s = 16, r = 40, b = 6 and  the initial values were x0 = y0 = 0.1 and z0 = −0.1. Inte- gration was done using Euler’s method with step size  0.01. The data were finally scaled down by factor of 20.   All time series are corrupted - af ter normalization - by  normally distributed noise with a standard deviation of  0.01.   As the computational complexity depends mainly on  the online learning step we compared different networks  with roughly the same number of trainable parameters.  For the activation function in the hidden layer we took   and in the output layer we took the identity map.  The blocks in the diagonal of the recurrent weight matrix  for the BDRNN were parameterized as   tanh() tanh( )sin()tanh( )cos() tanh()cos( )tanh()sin( )    with a pair of trainable parameters  and     indep-  endently for each block.   For the ESN model, the recurrent weight matrix was  initialized with random numbers uniformly distributed in  the interval [−1,1] with a sparsity degree of 95%; the  matrix was rescaled such that the maximum absolute  eigenvalue was 0.8. All other weights are initialized with  random numbers uniformly distributed in [−0.1,0.1].   For SpiralRNNs, a value 11l    guarantees that the  maximum eigenvalue of the recurrent weight matrix is  smaller than one. However, the actual eigenvalues can be  much smaller than one. Experiments have shown good  and reliable performance with 1  .   The online learning method was the same as described  above for all networks with appropriate gradients de- pending on the architecture. For the Kalman filter, the  covariance matrix Pt was initialized with the identity  matrix Id and the model noise covariance matrix  Qt=10−8Id was chosen to be constant. The measurement  covariance matrix Rt was initialized to 10−2Id and up- dated according to    1 (1)e eT tt RR     t              (12 )  with ˆ ex x t  the forecast error at time t and 001   .   During online learning we used teacher forcing. After  some specific time steps we performed multi-step fore- cast tests using the output as input for the next time step.  After each test we returned to the beginning of the test  and continued online learning with teacher forcing.   The prediction performance during the tests is meas- ured with the logarithmic normalized mean square error  (logNMSE):   0 0 2 10 11 11 log () ˆ tt d it it itt xx dt 2             (13)  Here, d is the dimension of the data, t0 is the time be- fore the test phase starts, t   is the maximum number of  test steps and    is the standard deviation of the whole  data set. This definition of an error measure takes into  account that the prediction error in the test phases in- creases with time; as we do not want the error measure to  be dominated by the large errors of the future prediction  steps we have chosen a measure corresponding to a  geometric mean instead of a traditional algebraic mean.            (11)  For the spike time series we performed tests with a  duration of t  =2000 time steps. For this time series  many forecast models tend to predict a small constant  value close to 1/21. This can give a small error as this  Copyright © 2009 SciRes                                                                                JILSA    5 RUDOLF SOLLACHER, HUAIEN GAO value is close to zero for 20 out of 21 time steps. There- fore, we evaluated the error measure for this particular  time series only with the time steps when the spikes oc- curred; this provides a much more informative measure.  Figure 3 shows the results for networks with about  100 trainable weights. Obviously, SRN and SpiralRNN  can predict the spike time series at least after 104 time  steps (recall that this corresponds to about 500 spike pe- riods). On the other hand, ESN and BDRNN fail to pre- dict the spike trains properly. Not quite unexpectedly,  SRN, having more degrees of freedom, performs best.  However, the results for SRN include only those runs  which were successful; after longer training a significant  fraction of runs failed due to dynamical instabilities. An  encouraging observation is that SpiralRNN is pretty  close in performance to the successful SRN runs, taking  into account the logarithmic error measure.  Figure 4 shows the results for the Lorenz time series  with tests consisting of 200 time steps. Here, SRN,  BDRNN and SpiralRNN show similar behavior and  good prediction capabilities at least after longer training.  ESN outperforms the other approaches in the beginning  of the training phase. This can be explained by the fact  that the training of its output weights is basically a linear  regression task and the Kalman filter learning is essen- tially the optimal method. However, ESN can hardly  improve its performance even for long training periods.  Obviously, for a good generalization it is necessary to  adapt the recurrent hidden layer weights as well, a fea- ture shared by the three other architectures. The interme- diate increase of the error for all approaches can be ex- plained with the characteristics of the time series which  obviously explores new state space areas in these phases.  Another interesting result is the test with the Mackey-  Glass time series. Figure 5 shows the result for networks  with about 100 trainable weights and prediction tests for  up to 200 time steps. Here, SpiralRNN clearly outper- forms the other approaches, even SRN. For smaller net- work size with about 50 trainable weights, SRN is  slightly better than SpiralRNN, while both approaches  show almost as good behavior as SpiralRNN with about  100 weights; however, ESN and BDRNN perform simi- larly poor. An example of autonomous output of the Spi- ralRNN model on MackeyGlass time series is shown in  Figure 6 where the difference between autonomous out- put and target data can be hardly identified until around  time step 400.  2.4 Stability  Stability of a trained SpiralRNN model has been demon- strated by autonomous tests with deviated initial input  values. Recall that, in autonomous tests, the model uses  the previous output value as the current input value. The  only exception is the initial input value which is supplied  from outside.  Figure 7 shows the autonomous outputs generated by a  trained SpiralRNN model from the simulation with data  set Spike21. Each sub-figure in Figure 7 shows the  autonomous output result from simulation with different  initial input values {±0.1, ±1, ±10,  ±100}respectively.   Without exception, all models with different initial  inputs converge after an initial transition phase to the  trained attractor regime with periodic spike output with  period 21.   Figure 3. Modified logNMSE for the spike time series  with period 21. The networks have about 100 trainable  weights. The test points correspond to the times  33.544.55 {10 ,10,10 ,10,10 } Figure 4. LogNMSE for the Lorenz time series and predic- tion tests with 200 time steps. The networks have about 100  trainable weights. The test points correspond to the times  33.253.5 4.755 {10 ,10,10,...,10,10 } Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  Copyright © 2009 SciRes                                                                                JILSA  6    Figure 6. The autonomous output of a trained SpiralRNN  model vs. corresponding target data. The solid line is auton-  omous output, and the dashed line is target data  Figure 5. LogNMSE for the Mackey-Glass time series and  prediction tests with 200 time steps. The networks have  about 100 trainable weights. The test points correspond to  the times   33.544.55 {10 ,10,10 ,10,10 } (a) -0.1                                          (b) -1  (c) -10                                         (d) -100    7 RUDOLF SOLLACHER, HUAIEN GAO (e) 0.1                                        (f) 1  (g) 10                                       (h) 100  Figure 7. Stability test of a trained SpiralRNN model from the simulation with the Spike21 data set. The sub-figures show the  results for different initial input values {−0.1, −1, −10, −100, +0.1, +1, +10, +100}respectively. The X-axis is the time in the  autonomous tests, and the Y-axis is the output  As another example, Figure 8 depicts the trajectories of  two hidden-neuron activations during the autonomous  test with a trained SpiralRNN model from a simulation  with the Lorenz data set. In the test phase, the Spi- ralRNN model, is iterated autonomously for 1000 time  steps starting with different initial input values. During  the iteration, the recurrent layer neuron states are re- corded. Sub-figures in Figure 8 show the trajectories of  two neurons. The five initial time steps of the autono- mous test are marked by black squares. Again, after  some initial transition phase the trajectories always con- verge to an attractor corresponding to the Lorenz time  series. Note that the sub-figures have different scale in  order to display the initial transition path, and that the  position of the att r a c t o r   is i n  t h e  r a n ge  [ −0.2, 0.6] for  the X-axis and [−0.2, 0.6] for the Y-axis in all sub-fig- ures.  3. Application Examples  3.1 NN5 Competition  NN5 competition1 puts emphasis on computational intel- ligence methods in data prediction. The data for this  competition correspond to the amount of daily money  withdrawal from ATM machines across England. These  data exhibit strong periodical2 (ref. Figure 9) behavior:  1http://www.neural-forecasting-competition.com/  2Note that the Easter Fridays in 1996 to 1998 should have the indices  “19”, “376” and “753” in the given data and the Saturdays before  Christmas   da y of 1996 and 1997 have the indices “283” and  “ 648”.  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  8   1 F  Strong weekly periodic behavior dominates the frequency spectrum, usually with higher values o n Thursday and/or Friday;    2 F  Important holidays such as the Christmas holi- days (including the New Year holiday) and the Easter holidays have a visible impact on the data;   3 F  Several of the time series such as time series No. 93 and No. 89 show strong seasonal behavior, i.e. a yearly period;   4 F  Some of the time series (like No. 26 and No. 48) show a sudden change in their statistics, e.g. a shif t in the mean value.   The dynamics associated with data have deterministic  and stochastic components. In general, they will not be  stationary, as for example more tourists are visiting this  area or a new shopping mall has opened. There are in  total 111 time series in the database, with each time se- ries representing the withdrawal from one ATM machine  and each data point in a particular time series indicating  the withdrawal amount of the day from the particular  ATM machine. All 111 time series were recorded from  the 18th March 1996 until 22nd March 1998, and there- fore contain 735 data points. The task of the competition  is to predict the withdrawal of each ATM machine from  23rd March 1998 to 17th May 1998, 56 data points in  total. The evaluation of prediction performance is based  on the so-called SMAPE error value defined in Equation  (14), where t x  and ˆt x  are respectively the predicted  output and data.  (a) {−4, −4, −4}                                (b) {−4, −4, +4}  (c) {−4, +4, −4}                                (d) {−4, +4, +4}  Copyright © 2009 SciRes                                                                                JILSA    9 RUDOLF SOLLACHER, HUAIEN GAO (e) {+4, −4, −4}                                (f) {+4, −4, +4}  (g) {+4, +4, −4}                                (h) {+4, +4, +4}  Figure 8. Stability test of a trained SpiralRNN model from simulations with the Lorenz data set. Each sub-figure shows the  trajectory of two hidden neurons during autonomous tests with different initial inputs. The starting point coordinates are  given beneath each sub-figure. The X-axis and the Y-axis respectively refer to activation values of these two hidden neurons.   The first five positions of the trajectory are denoted by black square marks  1ˆ100 ()2 ˆ ntt smape ttt x x E nx x    %       (14)  The competition did not require online learning. Nev- ertheless, we used this competition to test our approach  against sophisticated offline learning methods. As prior  knowledge was available through the data sets and as  computational complexity was not limited for this appli- cation, we have extended our approach in the following  way:    (a) data were rescaled and additional artificial in- puts were added pro viding periodicity information;    (b) a “committee of experts” was employed in or- der to reduce the prediction error due to local  suboptima;    (c) the EKF learning process was modified taking  into account the SMAPE error function.   Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  10  (a) sample 1  (b) sample 2  (c) sample 3  (d) sample 4  Figure 9. Sample data from NN5 competition dataset with  cross markers indicating Saturdays  The data presented to the neural network are mapped  to a useful range by the logarithm function. In order to  avoid singularities due to original zero values, they are  replaced with small positive random values. Additional  sinusoidal inputs are provided as a representation of cal- endar information. These additional inputs include:    Weekly behavior addressing feature 1 F: Refer to  Figure 10(a); the period is equal to 7.    Christmas and seasonal behavior addressing feature  2 F and 3 F: It is often observed from the dataset  that, right after the Christmas holiday, withdrawal of  money is low and then increases during the year, fi- nally reaching its summit value right before Christ- mas. Seasonal features do not prevail in the dataset,  but they do exist in several of them, e.g. time series  9, 88. As both are regular features with a yearly pe- riod, it makes sense to provide an additional input as  shown in Figure 10(b) which has period 365.    Easter holiday bump addressing feature 2 F: The  Easter holidays did not have as much impact on the  data dynamics as the Christmas holidays did, but it  shows an effect on the usage of ATM in some areas  (shown in some time series). Furthermore, as the  58-step prediction interval includes the Easter holi- days of year 19 98, the pr ediction over the holiday   Copyright © 2009 SciRes                                                                                JILSA    11 RUDOLF SOLLACHER, HUAIEN GAO  can be improved when the related data behavior is  learnt. This additional input uses the Gaus- sian-distribution-shape curve to emulate the Easter  holiday bump as in Figure 10(c).   (a) Weekly-input  (b) Christmas-input  (b) Easter-input  Figure 10. Additional inputs of neural networks. On the  X-axis is the time steps, and Y-axis is the additional input  value  Figure 11. Online training of SpiralRNN in NN5 competi- tion with additional inputs   Supplied with additional inputs, SpiralRNNs were  trained online (see Figure 11) such that data were fed-in  the network one-by-one and network parameters are  trained before the next data is fed-in.  Although SpiralRNN is capable of learning time series  prediction, the learned weights usually correspond to  local minima of the error landscape as mentioned in [11].  As computational complexity is not an issue for this  competition, a committee of experts ansatz is applied.   The committee of experts consists of several Spi- ralRNN models with identical structure but different  initialization of parameter values. Each SpiralRNN  model operates in parallel without any interference to the  others. They are trained online using teacher forcing. The  importance of an expert’s output is determined by the  average SMAPE value over the last 56 steps. Therefore,  at time step t=6791, each model k produces a prediction  xk,t for the next 56 steps using its output as the input for  next time step. After that, online learning continues until  the end of the training data. Then the SMAPE error value  for the last 56 time steps is calculated according to Equa- tion (15), where Es refers to the SMAPE error.   56 679 1( ˆ 56 kt kskt t Ex ) x                 (15)  1The value is calculated via: t = 795-56 = 679.  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  12  The final prediction for the 56 time steps following t =  735 consists of a weighted average of the committee  members’ autonomous predictions according to   2 1 n k k                    (16)   2 1 1n ttk k k x x                  (17)  The whole procedure is listed in Table 1, and a sche- matic diagram Figure 12 depicts the organization of the  committee.  As a last step we have to adapt the calculation of the  gradient matrix  in Equation (5) to the error func- tion    Table 1. Committee of experts  0. Initialize the n experts;  1.  For a SpiralRNN model k, implement online  training with the data and make a 56-step  prediction Uk at time step t = 679. The pre- diction value Uk will be compared to the data  in order to obtain the average SMAPE error  value according to Equation (15).  2.  After the prediction at time step t = 679,  continue the online training till the end, and  produce another 56-step autonomous predic- tion Vk.  3. Based on their k  values, combine the pre- diction Pk according to Equation (17).  Figure 12. Committee of SpiralRNN experts where their  weights in committee are determined by their SMAPE val- ues on testing dataset  Figure 13. Comparison between result and data showing  weekly behavior for time series 35. Dashed line with circles  is the data and solid line with squares is the prediction. The  x-axis shows the time in days  in Equation (14) according to Equation (20) with  () t ylnx t   being the output and ˆ() ˆt tln y x  the data in  logarithmic scale:   ˆ exp()exp( ) tt Sy y                  (18 )  ˆ exp( )exp( )Dy tt y                 (19 )   exp( )() tt t t yy D sign D SS        w       (20)  For data sets with missing values we replace these  with predicted values and simultaneously skip the weight  update while still aggregating the gradient.   We have evaluated our approach using the last 56 data  of the available data while the remaining data were used  for online-training. In Figure 13, prediction and data are  compared for time series 35 which has a pronounced  weekly periodicity. Obviously, this periodicity can be  reproduced pretty well, even details like the small bump.  Seasonal behavior of data can also be learned as  shown in Figure 14. The curves in both sub-plots begin  with the values in Christmas holidays (with time indices  around 280 and 650). The rise and fall of the data during  the first about 100 days and a subsequent rise in Figure  14(a) can also be seen one year later in Figure 14(b).  Obviously, the model is able to capture this behaviour  pretty well.  Copyright © 2009 SciRes                                                                                JILSA    13 RUDOLF SOLLACHER, HUAIEN GAO Easter holidays can also be recognized by the trained  model as shown in Figure 15 for time series 110. Since  the expert-weight in the committee is determined by a   (a) seasonal-data previous year  (b) seasonal-data and prediction  Figure 14. Comparison between prediction and data for  time series 9 showing seasonal behavior. Dashed curve is  the data and solid curve is the prediction. The x-axis de- notes time measured in days. The upper figure shows the  same period one year before   Table 2. Statistic results. The committee-averaged   SMAPE error value of the expert committee on all 111 time  series with varying number of experts  Figure 15. Prediction result (solid curve) and data (dashed  curve) for time series 110 showing a peak around Easter.  The x-axis displays time in days  period not containing Easter holidays, the prediction for  them might be expected to be less accurate. Neverthe-  less, as shown in Figure 15, the corresponding bump in  the data has been learned well. The Easter holidays in  1996 to 1998 occur at times around 20, 375 and 755. In  Figure 15, the prediction for the Easter holidays 1998  shows a similar behaviour as the Easter holidays in 199 6  and 1997 with a pronounced peak.  Table 2 shows the SMAPE errors on the test set (i.e.  the data from the last 56 time steps) for a varying num- ber of experts. Obviously, the number of experts has no  significant influence on the average result. This indicates  that the learning process is quite reliable; in addition this  allows saving computational effort using a small com- mittee. Figure 16 shows the histogram of committee-  averaged SMAPE values over the 111 time series in the  dataset. The majority of the results have a SMAPE value  around 20.   Our approach has won the time series forecasting  competition at WCCI20081 for both the full data set and  a reduced data set consisting of 11 time series. Among  the competitors was also commercial software currently  implemented in the ATMs. Although a few sophisticated  statistical and neural n etwork  appro aches wh ich were no t  taken into account for the awards performed better, the  final results for the top 10 performing approaches are  quite close. The most important result for our approach is  that even with a remarkably simple extension our online  learning approach is competitive to world class statistical  and neural network solutions.  3.2 Mouse Tracking — A Toy Example  Besides using artificial data we also made experiments  1http://www.wcci2008.org / . Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  Copyright © 2009 SciRes                                                                                JILSA  14  with real-life data. For this purpose we recorded the po- sition of a computer mouse moved along a periodic tra- jectory by a person and trained a SpiralRNN online with  these data. The background for this experiment is e.g.  applications in logistics and factory automation where  the motion of transported goods or robot arms has to be  tracked and analyzed.   For the mouse tracking we did some preprocessing on  the data in order to become independent of the velocity  of the mouse. For th is pur pose we took  th e path- leng th of  the trajectory as the “time”-variable and interpolated the  position at equidistant length intervals. These positions -  after a suitable rescaling - are used as input for the Spiral  RNN during online training, trying to learn to predict the  next position on this discretized trajectory. Figure 17  illustrates such discretization. Figure 18 shows one par- ticular trial. The blue line is generated by a user moving  the mouse along a lying figure 8. After about thirteen  iterations of this pattern the online training is stopped  and the Spiral RNN is fed with its previous output thus  generating  the red trajectory. Obviously, th e Spiral RNN  has learned to reproduce the full trajectory. The mean  square error (MSE) is normalized to the variance of the  time series. The reader may notice that this pattern is  nontrivial in the sense that a certain amount of short-term  memory is necessary in order to select the correct direc- tion at the crossing.  Such a pattern produces 2-dimensional periodical time  series as targets for the neural networks. For this exam- ple we have also analyzed the frequency spec t rum of the  data and the predicted outputs of different neural net- work models. The result is given in Figure 19 with blue  dashed and green solid lines for the two time series re- spectively. SpiralRNN produces (refer to Figure 19(b))  autonomous output trajectories with a frequency spec- trum similar to the frequency spectrum of the target data  Figure 16. Histogram (over 111 time series) of committee-  average SMAPE error  Figure 17. Segmentation by boundary points in MouseTrac k ing . The hollowed arrow indicates the movement direction, the  dashed line is the movement path of the cursor, circles represent the position measured by computer in every 1/100 second,  and black dots refer to boundary points that segment the move ment path of cursor. The identical length between boundary  points is “0.02 ”. The positions of these boundary points  ar e take n as the training data at each “time” step    15 RUDOLF SOLLACHER, HUAIEN GAO Figure 18. Mouse tracking example: The noisy line in the upper figure is the trajectory generated by the user; the smooth line  is the test with the Spiral RNN trained online and iterated with its own output as next input. The lower figure displays the  logarithm of the normalized one-step prediction mean square error which decays along training  (refer to Figure 19(a)), while the other models cannot.  If the training time was too short, the network usually  runs into the closest of a set of fixed points; two of them  are usually located within the two loops of the Figure  eight. This experience is confirmed with more complex  pattern where the required training time increases corre- spondingly. One reason is the usually large amount of  noise due to manually moving the mouse. This noise  prevents establishing a large enough short-term memory  which may be required for correct prediction and favors  the development of fixed-point attractors.   Sharp turns in a trajectory are also difficult to repro- duce. Typical examples are triangles or squares. Even if  the network has learned to reproduce such a pattern, it  frequency analysis of original trajectory                         frequency analysis of spiralRNN  (a) Target                                 (b) SpiralRNN  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  16  fre quency analysis of ESN                               frequency analysis of BDRNN  (c) ESN                                      (d) BDRNN  Figure 19. Frequency analysis of the output trajectory after 5000 training steps of different neural network models with  around 100 parameters. Note that the vector value at frequency “zero” is trivial and is therefore omitted in diagrams, and  that the period length of data is 130 therefore the dominant frequency is 008.0 130 1  always produces smooth turns and it takes much more  additional training to sharpen these turns.   3.3 Duty Cycle Adaptation in Wireless Sensor  Networks  In energy autarkic wireless sensor networks, achieving a  long lifetime requires the wireless sensors to switch off  individual components as often as possible. Most energy  is used by the transceiver in transmit mode as well as in  receive mode. Therefore, sensor nodes have organized  their activity in duty cycles consisting of an active period  and a passive period with most components switched off.  The longer the second passive period is, the smaller is  the duty cycle and the longer is the lifetime of the sensor  node.   One way of reducing the duty cycle effectively is to  skip listening to neighboring sensor nodes for their data  and instead use predicted values. For this reason, a sen- sor node does not have to transmit its own sensor values  to its neighbors as well. This implies that sensor nodes  can increase their inactivity period thus reducing the duty  cycle. Of course, the prediction error will increase also  with decreasing duty cycles. Therefore, there will be an  optimal choice concerning prediction error and energy  saving.   We analyzed this application example using a simula- tion of a room equipped with 10 temperature sensor  nodes and 4 heat sources as shown in Figure 20. The  varying strength of the heat sources was modeled by a  4-dim. chaotic Lorenz time series with 6000 time steps; a  short interval is shown in Figure 21. The diffusion of  heat from the heat sources into the room was modeled by  a simple diffusion model, discretized in space and time,  with random additive noise. An example for the resulting  sensor values is shown in Figure 22.  Each of the 10 sensor nodes is equipped with a Spi- ralRNN supposed to predict the sensor node’s measured  temperature value and those of the neighboring nodes.  The SpiralRNN had 100 trainable weights for this  evaluation. The neighbor nodes are those nodes which  have a direct communication link to the sensor node (see  Figure 20). In regular intervals, the sensor nodes broad- cast their current sensor value to their neighbors. One  complete exchange involving all 10 sensor nodes corre- sponds to one time step in the simulation.   In order to reduce the duty cycle, the sensor nodes do  not listen to their neighbors every second time step. This  saves energy for communication: with a normal duty  cycle, a sensor node with k neighbors has to run its trans-  ceiver 2*(k+1) times in two intervals; for a reduced duty  cycle it is 2+k times in two intervals, a reduction by a fa-  ctor of 2 22 k k  . With just one neighbor this is a factor of  Copyright © 2009 SciRes                                                                                JILSA    17 RUDOLF SOLLACHER, HUAIEN GAO Figure 20. Simulation environment with sensors (numbered circles) and heat sources (stars). Also shown are the communica- tion links between sensor node s  Figure 21. Example of time series of heat sources. Only one period is shown. Similar pattern occur in other periods  Figure 22. Example of measured temperature values at the different sensors. Only one period is shown. Similar pattern occur  in other periods.  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  18  Figure 23. Comparison of relative prediction error for normal duty cycle (continuous line, circles) and reduced duty cycle  (dotted line, squares, listening only every second interval). Shown are mean, minimum and maximum of the rms error of 30  runs  0.75, with 4 neighbors it is a factor of 0.6, with 9  neighbors already 0.55 and quite close to the large-k  limit of 0.5.   However, the picture is not complete without looking  at the accuracy of the predicted instead of measured  sensor values. We performed 30 independent runs with  randomly chosen locations of sensors and heat sources,  random scaling of each of the heat sources with a factor  in the range [0,1] and additive noise of amplitude 0.01 at  each update of the heat diffusion on every grid point.  The result for the relative error - normalized for each  sensor to the variance of its sensor value time series - is  shown in Figure 23: For up to 5 pred iction steps the per- formance with the reduced duty cycle is almost the same  as for the normal duty cycle. The outliers with a normal- ized error larger than one are due to sensors with an al- most constant sensor value whose variation is mainly due  to noise. This result indicates that there is even more  room for duty cycle reduction for this example: Listening  to the neighbors only every fourth interval yields a duty  cycle reduction of 4 44 k k  ; with 4 neighbors this gives a  reduction factor of 0.4, with 9 neighbors it gives 0.325  and the large-k limit is 0.25.  4. Conditional Prediction  So far, SpiralRNN turned out to be excellent in online  learning the dynamics of stationary processes. However,  in reality we often have to deal with non-stationary  processes. A typical example are cars approaching a  crossing: without knowledge about their destination one  usually can predict their approaching the crossing and  their leaving again, but not whether and into which di- rection they turn at the crossing. Even our daily activities  consist of sequences of typical pattern like walking,  standing, talking, eating, and drinking and so on. Again,  without additional knowledge about the motivation of a  person usually it is impossible to predict the sequence of  these patterns, which can themselves be predicted very  well.   4.1 Application Example: Warehouse Logistics  We consider an application example from the logistics  domain: A warehouse stores goods of different type in  different areas (refer to Figure 24). Newly delivered  goods have to be transported to these areas by trucks; we  assume that these trucks always use the same trajectory  to a specific destination. Each truck is equipped with a  SpiralRNN learning the different trajectories. However,  different from previous examples we assume that there is  additional information available about the type of prod- uct; for example, this could be provided by a RFID  reader at the entrance registering newly delivered goods  carrying RFID tags with product information. After hav- ing learned the trajectories accompanied with this addi- tional information, the Spiral RNN just receives this ad- ditional information and has to recall the trajectory. Th is  can be used for diagnosis purposes (e.g. is it the correct  trajectory?) or for autonomous truck driving.   Copyright © 2009 SciRes                                                                                JILSA    19 RUDOLF SOLLACHER, HUAIEN GAO In addition to the current position coordinates, the in- put vector of the Spiral RNN now also stores an addi- tional trigger signal [12]. The network output is trained  to predict the changes in this vector. At occurrence of  each trigger signal, the hidden state of the network is  reset to zero and the input vector consists of some arbi- trary, but otherwise fixed position vector and of the trig- ger signal; the latter is kept at this initial value until an- other trigger signal indicates completion of the trajectory  or start of a new trajectory. At subsequent steps, the  network - while in online learning mode - learns to pre- dict the changes in position and change in the trigger  signal; the latter is actually zero. In recall mode, the pre- vious output added to the previous input together repre- senting the new position and trigger signal is fed back as  input for the next time step.   Figure 24.Application example “warehouse”: Depe nding on their type,  newly delivered goods have to be stored in different  areas labeled by A-F. The trajectories used by trucks are shown as well  Figure 25. Conditional prediction of trajectory to area A (circles). The starting point of the trajectory is at [0.4,0].  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  20  Figure 26. Conditional prediction of trajectory to area B (circles). The starting point of the trajectory is at [0.4,0].  Figure 27. Conditional prediction of trajectory to area C (circles). The starting point of the trajectory is at [0.4,0].  Figure 28. Conditional prediction of trajectory to area D (circles). The starting point of the trajectory is at [0.4,0].  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  Copyright © 2009 SciRes                                                                                JILSA  21 Figures 25-28 show the performance of a trained Spi- ralRNN on reproducing  the trajectories to the areas A, B,  C and D given just the initial trigger signal. The network  had 200 trainable weights and each trajectory was pre- sented 20 times until the root mean square error (RMSE)  for each trajectory was below a threshold of 0.05, i.e.  half grid spacing. Obviously, the trajectories have been  learned pretty well.  4.2 Storage Capacity  Encouraged by the good performance of this approach,  we investigated the storage capacity of SpiralRNN in  dependence on the number of hidden neurons and the  length of the pattern time series. For this purpose we  created random two-dimensional trajectories with each a  different but randomly chosen trigger signal value. We  considered trajectories of 6 and 11 steps. The perform- ance of a trajectory prediction is evaluated as above as  the MSE over the whole trajectory. We consider a num- ber of trajectories trainable if within at most 30 trials  with each consisting of 50 presentations of each trajec- tory the RMSE over all trajectories is at least once below  the threshold of 0.05. The whole process contains 5 steps,  as the following:   1) The experiment starts with small networks (The  value of nh starts from 6, such that the network  comprises 3 hidden units and each hidden unit has  2 hidden nodes); Training data is the combination  of np = 2 trajectory patterns.   2) With each pair of {nh ,np}, at most 30 simulations  are performed, each of which contains 50 training  rounds. In each  training  round, th e learning  model  is trained on-line with the presentation of these np  trajectory patterns, in a shuffled sequence but  each pattern only once. The hidden-state vector of  network is reset before the training with any tra- jectory pattern starts. After every 10 training  rounds, a testing round starts, trying to reproduce  all np trajectories in an autonomous manner sepa- rately.  3) The autonomous output results are evaluated. Let  lp be the length of trajectory patterns (the lp value  is identical to all pattern), xi,k be the ith-step ahead  prediction over the kth pattern, and ˆik x   be the  respective target, the evaluation error ε is calcu- lated as:   2 11 11 ˆ pp ln ik ik ik pp x x nl                   (21)  4) If it satisfies 2 005   , this particular Spi- ralRNN network with nh hidden nodes is claimed  to be capable of reproducing np number of trajec- tory patterns of length lp, and thus the associative  memory satisfies Ma ≥np. The model (with the  same structure but re- initialized values) will then  be assessed with a training data with more pat- terns, i.e. higher np value.  If 2 005   is not satisfied at least once within 30  simulations, it is claimed that SpiralRNN with nh  hidden nodes is not able to reproduce np number  of trajectory patterns with length lp. New experi- ment will start with same value in np but a larger  SpiralRNN model by increasing the nh value by 3.  Note that the threshold value is set to 0.05, half  the grid distance 0.1, so that the prediction can be  rounded to the near est grid point if the error is less  than 0.05.  5) The experiment are continued until simulations  with nh = 60 are finished.  Figures 29 and 30 show the dependence of the number  of trainable pattern on the number of hidden neurons.  Obviously, the graphs show an almost linear dependence.  For short pattern of length 6 we get a relation np / nh =  0.64 ± 0.04 and for long pattern of length 11 we get np  / nh = 0.31 ± 0.02.   5. Tiny OS Implementation  Finally we report on an implementation of SpiralRNN in  a typical wireless sensor platform. One example of such  a sensor platform is the TelosB[13] sensor mote from the  University of California at Berkley. It provides sensors  for temperature, light and humidity. A microprocessor  (Texas Instruments, MSP430, 16 bit, 8 MHz, 48kB ROM,  10kB RAM) allows to run software for communication,  data processing and energy management. A wireless  transceiver (Chipcon 2420, IEEE802.15.4, 2.4 GHz, 250  kbps) provides communication links to other sensor  motes or to a gateway.    RUDOLF SOLLACHER, HUAIEN GAO  22  Figure 29. Pattern storage capacity of SpiralRNN with dif- ferent number of hidden nodes and with patterns of length 6  Figure 30. Pattern storage capacity of SpiralRNN with di-  fferent number of hidden nodes and with patterns of length 11  Figure 31. Structure of a typical smart sensor mote. TinyOS is the specific operating system for embedded system such as  smart sensor and provides interfaces among hardwares and softwares. NesC language, as a compiler, compiles and builds  softwares for different applications   We have deployed TinyOS, a small footprint operating  system developed at the University Berkeley, on such  motes. Programs developed for TinyOS are written in the  programming language NesC [13]. Figure 31 shows the  typical software architecture of a sensor mote. It allows  implementing even advanced functionalities like analysis  and learning on such embedded devices.  In order to accelerate the task execution, we replaced  the floating point operations by suitable fix-point opera- tions. Thus we achieved a runtime for one online learn- ing iteration of the SpiralRNN with a single input/output  node and 6 hidden nodes (totally 24 parameters in the  neural network) of roughly 300 milliseconds, a reduction  of about a factor 8 compared to a floating point arith- metics implementation. Figure 32 shows a comparison of  the floating-point model and the fixed-point model with  respect to processing time and memory requirement  (RAM) for a TelosB sensor mote (both models have sin- Copyright © 2009 SciRes                                                                                JILSA    23 RUDOLF SOLLACHER, HUAIEN GAO gle input and output neuron). The limited memory ca- pacity (10 Kbytes) of such a sensor mote prevents de- ployment of floating-point model images with more than  36 parameters or running it with more than 32 parame- ters. In contrast, the fixed-point model takes less than  one second for each online learning step for networks  with up to 40 weight parameters (i.e. 10 hidden nodes).  Similarly, the fixed-point model requires less RAM al- lowing working with more complex neural networks.   It is also interesting to compare the processing time of  the pure feed-forward step with the complete online  learning step. Figure 33 compares two fixed-point mod- els with feed-forwarding (FFD) only and with full online  learning (FULL) including feed-forwarding, gradient  calculation and extended Kalman filter update. Obvi- ously, the full online learning step requires much more  time; while a pure feed-forward step requires only 11ms  for 48 weight parameters, the full online learning step  takes about 1500ms for the same network. As a conse- quence, one can reduce the number of learning steps  when the neural network is well trained and thus save  quite some energy.   We close this section with a comparison of a theoreti- cal estimate of the computational complexity with the  measured times for one update step. Measured in multi- ples of the summed effort for an average elementary  fix-point or floating point operation consisting of multi- plication, addition and potentially some copying, the  feed-forward iteration cost is proportional to 2 h n   with nh the number of hidden neurons  and d the dimension of the input and output. This can be  expressed in terms of nw=(2d+2)nh+d-1, the number of  weights, leading to nh=(nw-d+1)/(2 d+2). The effort for  the derivative calculation is proportional to   and the effort for the extended  Kalman filter updates is proportional to 23 2w dddn  (21) h dn 2 ( hh nn d dn ) w 2 w Although the complex of the matrix inversions in  Eq full online learning pared with the measured proc- es h dn 22 352 www ndndnn . ity uation (4) scales as 3 d, in practice, this is not the  dominant contribution asw nd≫ can be assumed. The  dominant contribution for tl online learning step  comes from the term 23 2 (2 2) hw w nn nd  of the gra- dient calculation of the Kal pure feed-forward step the leading term is  22 2 (2 2) hw nn d , i.e. a factor nw smaller than for the   step.   These estimates are com he ful man filter update. For the  sing times in Figure 34. Here, the processing times are  plotted against the estimated effort for d=1 and varying  nw. The linear relationships and the very similar propor- tionality coefficients confirm the theoretical estimates.  As a by-product we get a typical processing time per  elementary optimized floating-point operation of about  0.063ms.  (a) Processing time  (b) RAM requirement  Figure 32. Comparisons of processing time and RAM Re- amounts parameter in model  quirement between floating-point models and fixed-point  models that installed in TelosB sensor motes with different  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  24  Figure 33. Comparison of processing time between models with feed-forward functionality only and with full function- hown excellent prediction perform- ance and stability during online learning making them a  rediction. Here, different temporal pattern are  ac ction tasks. It turned out that the  nu tation of SpiralRNN on a typical wireless sen- sor platform. A significant reduction of the runtime has  be ne learning rule can be derived from a Bayesian   the probability density  ality (including feed-forwarding, gradient calculation, ex- tended Kalman filter learning)  6. Conclusions  Spiral RNN have s good candidate for application in autonomous systems. A  first example are wireless sensor networks where Spi- ralRNN can be used to reduce duty cycles and thus en- ergy consumption leading to an increased battery life- time.   Another interesting and not so rare use case is condi- tional p companied by an additional constant signal which can  be derived from an initial trigger signal. This approach  allows training different pattern reliably provided a trig- ger signal is available. Such trigger signals can also be  used to suspend online training, e.g. in periods with just  noise on the input.   We have analyzed the storage capacity of SpiralRNN  in conditional predi mber of successfully trained temporal pattern is pro- portional to the number of hidden neurons. This result is  qualitatively similar to the memory capacity of Hopfield  networks [15]. As expected, the storage capacity for  longer pattern is smaller than the one for shorter pattern.  Further investigations might clarify whether there is a  typical storage capacity per hidden node and pattern  length.   Finally, we have reported about an embedded system  implemen en achieved by applying fix-point arithmetics. This  allows developing solutions based on online learning of  dynamics with reasonably complex models. A typical  example is duty cycle adaptation in wireless sensor net- works.  Appendix  The onli approach forˆ tt f wX of  the  weights w given all previous data 10tt t   xx x X.   ˆ{} ˆˆ ˆ (a) Feedforward  (b) Full online learning step  Figure 34. Processing time as a function of the theoretical  estimate of computational complexity for fixed-point im- plementation: fnline learning  step (bottom). The linear behavilmost identical  eedforward (top) and full o or and the a proportionality factors of 0.063 ms/operation indicate the  validity of the theoretical estimate  Copyright © 2009 SciRes                                                                                JILSA    25 RUDOLF SOLLACHER, HUAIEN GAO We start with a generic dynamic system:    1 1ˆt ttt   ssw x H             (22 )   1 ˆˆ tt tt t   sw xx G            (23)  t1tt   ww    Equation (22) describes the dynam system by a (hidden) state vector  of depends in some, usuallynlinear way on the  pr              (24)  ic evolution of the  t s at time t. The value  t s  at  no mo evious hidden state vector 1t s, on the previous ob- servation vector 1 ˆt x and on the del parameters t w.  Equion (23) describes the measurement process: ˆt x is  the vector of observations (sensor data) at time  t, (. .  . )G is the el estimation of these observations  based on t s, t w and 1 ˆt x. The variable t mod   the  measurement noise vector, assumed to be normally dis- trith a probability distribution function (p.d.f.)  is  ibuted w ( tt 0 ) t f R   NFinallyuation (24) descris the e-  volution of the model parameters: that the dynamics of  the environment within a reasonable time window are  . , Eqbe assumed to be stationary, i.e. the model parameters are  static, up to additive random fluctuations t  ; these are  assumed to be normally distributed with zero mean, co- variance matrix t Q and thus with a corresponding p.d.f.  (0) ttt f Q   N.   Using the Bayes rule, the chain rule and the Chap- man-KolmogorovEquation (16), the conditional p.d.f.   ˆ tt f wXfor the moderameters at time t given the current  an l pa d all previous observations 10 ˆ{} ˆˆ ˆ tt t  xx x X is  given by following equations:   1 11 ˆˆ ˆ ˆ ˆ t tt tt tt 1 ˆ ˆ ˆ t tt tt f w ff ff     x X X            (25)      wx ww x XX X 11 ˆˆˆ ˆˆ tt ttttt tt1t t f ffd  s      (26)     wswsw xx XXX 11 111 1 ˆˆˆ tttt ttt t fffd      wwww w XXX      (27)  Equation (25) is the famous Bayes rule, Equation (26)  introduces the hidden-state vector  into the “evi- dence” p.d.f. and Equation (27) is the “prio the model parameters . The last equation th t s r” p.d.f. for   introduces  t w e conditional p.d.f. 11 ˆ tt f wX which allows to interpret  Equation (25) to Equation (27) as iterative update equa- tions for ˆ tt f wX.   Before the on-line training rules from these update  equations are derivedas to introduce some as- sumptions and approximations for the specification of  each p.d.f , one h .:   111 ˆ11 () () ttt att f P     w Sw w XNis assumed to be  normally distributed with mean 1t  w and co- variance matrix 1t P. In general this p.d.f. will  be multi-modal; however, if the environment  doesn’t change its dynamics he m eco  too fast and if the  model is powerful enough, then todel pa- rameters should bme static and this assump- tion is justified.   11 ˆ1 ( ) tt tttt () b f Q      ww ww XN according to the a-  ssumption made in Equation (24).   S t with  ious  121 ˆ ()(( (...))) ˆˆ tt ttt ctt f      sw Ssww xx XHH the dots indicating an iterat e prev hidden states  ion of the function  ()H which further represents th   12... t   s according to uation (22). The Dirac delta-function   eq-  ()  corresponds to a normal distribution with in- simally small covariance matrix and re- flects the ass any uncertainty  comes from the measurements and from ( dom) changes of model parameters.   1 ˆ ˆtttt f finite umption that ran- () d  sw xX in Equation (26) is normal distributed  with mean value 1 () ˆt tt swx G and covariance  matrix t R according to Equation (23) S , i.e.     11 () ˆˆ tt ˆ ˆttt ttt t f R      sw xx NG .   the specifications  sw xX   Sa  b, Equation (27)  olution  normal distributions and thus f  S 1 ˆ tt  wX  With and  is a conv of can be easily calculated:   1 † 1 ˆ †1 ttt tt tt t f P PPQ         www XN         (28)  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  26  Given the   -function lik.f. 1 f in specifica- tion   c S and the p.d.f. of 1 ˆ ˆttt t f sw xX in specification   d S, one finds:  e p.dˆ tt t swX   11 ˆ ˆˆˆ ttttt tt t f R            w xsw xx XNG       (29)    21 ˆˆ tt tt   ade in s t sww    xx HH       (30) The same assumption as mpecification   a S  11 ˆ tt f wX - namfor ely the slow change of the m ram when the model has been well trained -  ju er approximation: the function odel pa- eters t w stifies anot h ()  G can  be rized with respect to their dependence mters . In particular,   linea odel param on the  et w 11 ()() ˆtt tttt g swxww x        where the output vector t x and the gradient matrix  are respectively defined as:   (31)  †1 1and ˆt t t tt d     x xs wx G   (32)  1t d w Finally, the definition of is given here. pa † t s  tio  According  to Equation (30) and Equan (31), † t s should be the  hidden state vector calculated with the previous model  rameters 1t w. However, this requires a recalculation  of the whole history wheer the model parameters  change and therefore is not a viable slution fo learning in distributive sep Instead, the  app th nev nsor a 3)  he  o plication. r on-line  roximation in Equation (3is applied, which implies  at the model takes † t s as tflawless hidden value.     †21 21 ˆˆ tt tt t   sww xx HH       (33)  With the approximations in Equation (31), Equation  (32) and Equation (33), the calculation of 1 ˆ ˆttt X f  w x in  Equation (29) is rewritten as:    1 ˆ ˆ †1 1 1 (( ) ˆˆ () ttt tt t t t tt )R          x sw xx ww X NG ), the p. fw      (34)  Combining Equation (28) and equation (34d.f.  ˆ tt f wX  in Equation (25) ends up with a normal distribu-  Equation (35). Note that is computed n (28) and is the gradient mtrix.   tion in Equatio † t P  a  in    ˆ() ttt tt f NP   www X          35)       (  1ˆt tt tt t PR  x ww x 1T          (36)  on † 1 †† † tt TT tttt PP PPRP             (37)  Equation (35) is consistent with specificati    a S  whilst Equation (36) and Equation (37) constitute t extended Kalman filter (EKF) [16,17,18]).  REFERENCES  [1] T. Schott, “Global megatrends and their effects on production of the future,” in Conference Proceedings, 5th  wikipedia.org/wiki/Ambient_intelligence.  [3] X. Cai, N. Zhang, Grthy, and D. C. I.  Wunsch, “Timh recurrent neural  networks using a hybrid pso-ea algorithm,” in Proceed- f IEEE International  orks IJCNN ’05, R.  d   pp. 815–   analyzing and  nt  he   (for more details see   the  International Conf ere nce on Ind ustrial Informatics, Vienna,  Austria, Vol. 1, pp. 18, July 23-27, 2007, http://www.  indin2007.org/keynote_schott.php.  [2] Wikipedia, Ambient intelligence.   http://en. . Venayagamoo e series prediction wit ings of IEEE International Joint Conference on Neural  Networks, N. Zhang, Ed., Vol. 2, pp. 1647–1652, 2004.  [4] H-G. Zimmermann, H-G. Zimmermann, R. Grothmann,  A. Schafer, and C. Tietz, “Dynamical consistent recurrent  neural networks,” in Proceedings o Joint Conference on Neural Netw Grothmann, Ed., Vol. 3, pp. 1537–1541, 2005.  [5] J. L. Elman, “Finding structure in time,” Cognitive Sci- ence, Vol. 14, No. 2, pp. 179–211, 1990.  [6] A. Waibel, T. Hanazawa, G. Hinton, and K. Shikano,  “Phoneme recognition using time-delay neural net- works,” IEEE Transactions on Acoustics, Speech an Signal Processing, Vol. 37, No. 3, pp. 328–339, 1989.  [7] P. Mastorocostas and J. Theocharis, “On stable learning  algorithm for block-diagonal recurrent neural networks,  part 1: The RENNCOM algorithm,” IEEE International  Joint Conference on Neural Networks, Vol. 2, 820, 2004.  [8] H. Jäger, “The ‘echo state’ approach to training recurrent neural networks,” German National  Research Center for Information Technology, Technical  Report GMD 148, 2001.  [9] H. Gao, R. Sollacher, and H-P. Kriegel, “Spiral recurre neural network for online learning,” in Proceedings of the  15th European Symposium on Artificial Neural Networks,  Copyright © 2009 SciRes                                                                                JILSA   RUDOLF SOLLACHER, HUAIEN GAO  Copyright © 2009 SciRes                                                                                JILSA  27 ms and D. Zipser, “A learning algorithm for  networks,” in Proceedings of nal Intelligence and rna- d S. S.  ESANN’2007, April 25-27, 2007, Bruges, Belgium, M.  Verleysen, Ed., pp. 483–488, 2007.  [10] R. J. Willia continually running fully recurrent neural networks,”  Neural Computation, Vol. 1, pp. 270–280, 1989.  [11] R. Sollacher and H. Gao, “Efficient online learning with  spiral recurrent neural   Venk World Conference on Computational Intelligence WCCI ’08,  2008.  [12] H. Gao and R. Sollacher, “Conditional prediction of time  series using spiral recurrent neural network,” in Proceed- ings of the European Symposium on Artificial Neural  Networks-Advances in Computatio  stoc Learning, Bruges , Belg ium, April 23 -2 5 , 2008.  [13] J. Polastre, R. Szewczyk, and D. Culler, “Telos: Enabling  ultra-low power wireless research,” in The Fourth Inte tional Conference on Information Processing in Sensor Net- works: Special track on Platform Tools and Design Methods  for Network  Embedd ed Sens ors, p p. 364– 369, 2005 .  [14] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer,  and D. Culler, “The nesC language: A holistic approach to   networked embedded systems,” ACM SIGPLAN Notices,  Vol. 38, No. 5, pp. 1–11, 2003. http://www.tinyos.net/.  [15] R. J. McEliece, E. C. Posner, E. R. Rodemich, an atesh, “The capacity of the Hopfield associative  memory,” IEEE Tra nsact ions on Information Theory, Vol.  33, No. 4, pp. 461–482, 1987.  [16] F. Lewis, “Optimal estimation: With an introduction to hastic control theory,” A Wiley-Interscience Publica- tion, ISBN: 0-471-83741-5, 1986.  [17] R. Kalman, “A new approach to linear filtering and pre- diction problems,” Transactions of the ASME–Journal of  Basic Engineering, Vol. 82, pp. 35–45, 1960.  [18] G. Welch and G. Bishop, “An introduction to the Kalman  filter,” University of North Carolina at Chapel Hill, De- partment of Computer Science, Technical Report, pp.  95-041, 2002.  | 

