Towards Real-World Applications of Online Learning Spiral Recurrent Neural Networks

doi:10.4236/jilsa.2009.11001

Paper Menu >>

Journal Menu >>

J. Intelligent Learning Systems & Applications, 2009, 1, 1-27

Published Online November 2009 in SciRes (www.SciRP.org/journal/jilsa)

Towards Real-World Applications of Online Learning

Spiral Recurrent Neural Networks

Rudolf SOLLACHER1, Huaien GAO2

1Siemens AG, Corporate Technology, Munich, Germany; 2Department of Computer Science, Ludwig-Maximilians University, Mu-

nich, Germany.

Email: Rudolf.Sollacher@siemens.com, gheken@hotmail.com

Received October 21st; 2008; revised January 19th, 2009; accepted January 21st, 2009.

ABSTRACT

Distributed intelligent systems like self-o rganizing wireless sensor and actuator networks are suppo sed to work mostly

autonomous even und er changing enviro nmenta l conditio ns. This requires ro bust and efficient self-learning capabilities

implemen table on embedded sys tems with limited memory and computational power. We present a new solutio n called

Spiral Recurrent Neural Networks (SpiralRNN) with an online learning based on an extend ed Kalman filter and gradi-

ents as in Real Time Recurrent Learning. We illustrate its stability and performance using artificial and real-life time

series and compare its prediction performance to other approaches. SpiralRNNs perform very stable and show an ac-

curacy which is superior or similar to other state-of-the-art approaches. In a memory capacity evaluation the number

of simultaneously memorized and accurately retrievable trajectories of fixed length was counted. This capacity turned

out to be a linear function of the size of the recurrent hidden layer, with a memory-to-size ratio of 0.64 for shorter tra-

jectories and 0.31 for longer trajectories. Finally, we describe two potential applications in building automation and

logistics and report on an implementation of online learning SpiralRNN on a wireless sensor platform under the

TinyOS embedded operating system.

Keywords: recurrent neural network, online learning, prediction, sensor actuator network

1. Introduction

Visions like the “intelligent factory” [1] or “ambient in-

telligence” [2] rely heavily on embedded systems with a

high degree of autonomy. Of particular interest are en-

ergy autarkic self-organizing wireless sensor networks

due to their simplified installation and robustness during

operation. These consist of so called “sensor nodes”,

hardware platforms with usually a battery, a microproc-

essor, a transceiver module and one or more sensors.

They are able to communicate wirelessly with each other;

suitable protocols enable them to set up a network by

themselves. If necessary, data are relayed by other nodes

to a destination. Data may also be processed locally or

inside the network.

A very useful kind of data processing is forecasting

data. Examples for applications in building automation

and logistics are given at the end of this paper. Solutions

for this task have to fulfill at least the following require-

ments in order to be really useful:

R1 The limited computational power and memory of

these sensor nodes puts a limit on the computa-

tional complexity of the forecasting solution.

R2 As the environmen t of the sensor networ k is usu-

ally unknown in adv ance th e solution has to learn

a forecast model online and in an efficient way.

Neural networks, and in particular recurrent neural

networks, have pro ven their suitability at least for offline

learning forecast tasks. Examples can be found in [3] or

[4]. However, traditional approaches like Elman or stan-

dard recurrent neural networks (SRN) [5], time delay

neural networks (TDNN) [6], block-diagonal recurrent

neural networks (BDRNN) [7] or echo state neural net-

works (ESN) [8] have deficiencies with respect to the

requirements above. For a TDNN one has to specify the

length of the history to be considered as input. If this is

chosen too small, some time series like spike time series

with large inter-spike interv als cannot be predicted relia-

bly. Elman networks avoid this problem due to their re-

current hidden layer; this allows them to adjust their

short term memory in a data driven way by online learn-

ing. However, their basically unbounded recurrent

weight matrix sometimes leads to dynamically unstable

behavior making them less attractive for online learning

RUDOLF SOLLACHER, HUAIEN GAO

purposes. Echo state networks try to cure this deficiency

by choosing random but constant weights for the recur-

rent weight matrix, but such that its ab solute eigenvalues

are smaller than one; only the weights of the outpu t layer

are trained. Unfortunately, with this approach the recur-

rent layer has to be large enough in order to provide a

rich reservoir of responses from which to construct the

output signals. Block diagonal recurrent neural networks

try to cure the dynamical instabilities by a specific struc-

ture of the recurrent layer. The corresponding weight

matrix is zero except for 2×2 diagonal block matrices.

These blocks can be trained but there is a constraint on

their maximum eigenvalue. This constraint is either im-

plemented implicitly via training or explicitly via a suit-

able parametrization of the weights. For online learning

the scaled orthogonal version described in [7] is more

useful. However, this structure seemed to be too con-

strained leading to poor performance for some prediction

problems. This motivated us to develop a new approach

called Spiral Recurrent Neural Network (SpiralRNN)

[9].

2. Spiral Recurrent Neural Networks

(SpiralRNN)

2.1 Architecture

Like other Recurrent Neural Network (RNN) models,

SpiralRNN consists of an input layer feeding into a re-

current hidden layer and an output layer which receives

input from the recurrent hidden layer. The hidden neuron

states t and the output neuron states satisfy the

following u pdate equations:

()

thidtinth

gW W



ssxb

)

(

tout tout

hWxsb (1 )

Here, and are bias vectors, matrices,

and represent synaptic connections between

layers and

hid

out

bin

hid

()

and are activation functions. Us-

ing the data instead of the previous network output

as input in Equation (1) corresponds to teacher

forcing; this is used when ever data are available.

()h

xt

1t

The main difference to other approaches lies in the

structure of the recurrent hidden matrix . Motivated

by the analysis underlying the development of echo state

neural networks (ESN) [8], we have chosen a constraint

on the weights of the recurrent layer in order to guaran-

tee stability, respectively the “echo state” property, while

still allowing the weights to be trained. With the follow-

ing structure,

hid

0...

…0

hid l

lll

























 (2)

with l the number of hidden nodes in the recurrent layer,

tanh( )[11]







,k



trainable parameters

and a suitable fixed , one can guarantee a

bounded eigenvalue spectrum of and thus stability

in the sense of the “echo state” property:



R

Whid

()(1

tanh l



 





)



 



 (3)

The structure of is best illustrated with Figure 1,

also giving an idea for the choice of the name “Spi-

ralRNN”.

hid

It turned out that a block structure in case of several

inputs is a good compromise concerning computational

complexity and performance (see Figure 2). Although

the recurrent hidden layer has this block structure, all

input neurons are coupled to all hidden neurons; the

same holds f o r t he out put neurons.

2.2 Online Learning

Due to the second requirement mentioned in the intro-

duction the learning of the neural network weights has to

be online. We understand online learning as a continuous

update of weights controlled only by the data. Realtime

Recurrent Learning (RTRL) is based on an approximate

error gradient taking into account all previous steps; it

sometimes shows slow or poor convergence. Particle

filtering and evolutionary approaches require much memory

Figure 1. Structure of SpiralRNN recurrent hidden layer

for 1-dimensional input; only the connections rooted in one

neuron are shown

RUDOLF SOLLACHER, HUAIEN GAO

Figure 2．SpiralRNN for 3-dimensional input; the recurrent hidden matrix is split into 3 blocks but all other layer s are fully

connected

The last line in Equation (6) is the gradient calculated

in real time recurrent learning (RTRL) [10]. This ap-

proximation is justified if the weight vector is changing

slowly which holds at least for the later phase of the

converging learning process.

in order to provide reasonable behavior. As explained in

[9] the optimal choice is an ex tended Kalman filter (EKF)

based on gradients calculated like in Realtime Recurrent

Learning (s. appendix).

According to this derivation, the iterative update rule

for the weight vector , representing all weight matri-

ces , and and all bias vectors and

, is the following:

out

Whid

out

It is worth mentioning that these iterative Kalman fil-

ter equations for the update of expected weight values

and their covariance matrix are derived analytically un-

der three reasonable approximations: 1) the distribution

of noise on the data is Gaussian, 2) the weights change

slowly and 3) the distribution of weights representing

their uncertainty is unimodal and also Gaussian. While 1)

is a standard assumption, the other two approximations

deserve some explanation: 2) is true if the process gener-

ating the data is stationary and if the model is powerful

enough to reproduce the data; only in the initial phase of

learning and for very non-stationary processes this as-

sumption might be questionable. Usually, 3) is only true

for a local neighborhood around the true weight vector;

in general, the distribution will be multi-modal implying

several competing weight vectors and thus models.

However, for complexity reasons we have to focus on

one set of weights; we have no guarantee that we get the

optimal model but we will find a “locally” optimal

model in the sense, that small changes of the weights

lead to a decrease in performance. As a consequence we

will end up with a distribution of different models in

repeated experiments with different initial conditions or

noise. The reader should notice, that there will be always

several degenerate solutions, because a renumbering of

†1

†† †1

()

ww( x)

tt t

tt tttt

tt ttt

PPQ

PPPPRP





 

 



(4)

x is the output vector of the Spiral RNN and are

the data. Pt is the covariance matrix of representing

its uncertainty and is the incremental increase in this

uncertainty. is the measurement error covariance

matrix. represents the gradient of the output

w.r.t. (notice the indexing):

ˆt



1t

ttt

tttt





 



xxs x

wsww



(5)

111

























(6)

RUDOLF SOLLACHER, HUAIEN GAO

the hidden neurons yields exactly the same dynamical

model but the weight vector is changed; the components

have just been reshuffled. This is actually an advantage

for learning because it is sufficient to reach one of these

degenerate solutions.

2.3 Performance

Spiral RNN have been compared to other architectures

like ESN, Elman networks and BDRNN [9]. TDNN is

not included due to the need to fix the number of historic

inputs in advance. The generalization ability of these

networks was tested with three artificial time series: A

spike time series with spike intervals of 21 time steps (20

zeros followed by a one), and the chaotic Mackey-Glass

and Lorenz time series. The one-dimensional Mackey-

Glass time series is given by

()

() ()

1( )

ax t









bxt (7)

with the parameters a = 0.2, b = 0.1, c = 10 and



=17.

The differential equation was solved numerically using

Euler’s method with step size 1. All time series values

before start time were set to 0.5. The time series was

normalized by its standard deviation.

The three-dimensional Lorenz time series is defined

by the following equations:

()

sy x

 (8)

()yrzxy 

 (9)

zxybz

 (10 )

The parameters were set to s = 16, r = 40, b = 6 and

the initial values were x0 = y0 = 0.1 and z0 = −0.1. Inte-

gration was done using Euler’s method with step size

0.01. The data were finally scaled down by factor of 20.

All time series are corrupted - af ter normalization - by

normally distributed noise with a standard deviation of

0.01.

As the computational complexity depends mainly on

the online learning step we compared different networks

with roughly the same number of trainable parameters.

For the activation function in the hidden layer we took

and in the output layer we took the identity map.

The blocks in the diagonal of the recurrent weight matrix

for the BDRNN were parameterized as

tanh()

tanh( )sin()tanh( )cos()

tanh()cos( )tanh()sin( )

 



with a pair of trainable parameters and





indep-

endently for each block.

For the ESN model, the recurrent weight matrix was

initialized with random numbers uniformly distributed in

the interval [−1,1] with a sparsity degree of 95%; the

matrix was rescaled such that the maximum absolute

eigenvalue was 0.8. All other weights are initialized with

random numbers uniformly distributed in [−0.1,0.1].

For SpiralRNNs, a value 11l





 guarantees that the

maximum eigenvalue of the recurrent weight matrix is

smaller than one. However, the actual eigenvalues can be

much smaller than one. Experiments have shown good

and reliable performance with 1



.

The online learning method was the same as described

above for all networks with appropriate gradients de-

pending on the architecture. For the Kalman filter, the

covariance matrix Pt was initialized with the identity

matrix Id and the model noise covariance matrix

Qt=10−8Id was chosen to be constant. The measurement

covariance matrix Rt was initialized to 10−2Id and up-

dated according to

(1)e eT









(12 )

with ˆ

ex x



the forecast error at time t and 001





.

During online learning we used teacher forcing. After

some specific time steps we performed multi-step fore-

cast tests using the output as input for the next time step.

After each test we returned to the beginning of the test

and continued online learning with teacher forcing.

The prediction performance during the tests is meas-

ured with the logarithmic normalized mean square error

(logNMSE):

11 log ()

itt

















 (13)

Here, d is the dimension of the data, t0 is the time be-

fore the test phase starts, t



is the maximum number of

test steps and



is the standard deviation of the whole

data set. This definition of an error measure takes into

account that the prediction error in the test phases in-

creases with time; as we do not want the error measure to

be dominated by the large errors of the future prediction

steps we have chosen a measure corresponding to a

geometric mean instead of a traditional algebraic mean.













(11)

For the spike time series we performed tests with a

duration of t



=2000 time steps. For this time series

many forecast models tend to predict a small constant

value close to 1/21. This can give a small error as this

RUDOLF SOLLACHER, HUAIEN GAO

value is close to zero for 20 out of 21 time steps. There-

fore, we evaluated the error measure for this particular

time series only with the time steps when the spikes oc-

curred; this provides a much more informative measure.

Figure 3 shows the results for networks with about

100 trainable weights. Obviously, SRN and SpiralRNN

can predict the spike time series at least after 104 time

steps (recall that this corresponds to about 500 spike pe-

riods). On the other hand, ESN and BDRNN fail to pre-

dict the spike trains properly. Not quite unexpectedly,

SRN, having more degrees of freedom, performs best.

However, the results for SRN include only those runs

which were successful; after longer training a significant

fraction of runs failed due to dynamical instabilities. An

encouraging observation is that SpiralRNN is pretty

close in performance to the successful SRN runs, taking

into account the logarithmic error measure.

Figure 4 shows the results for the Lorenz time series

with tests consisting of 200 time steps. Here, SRN,

BDRNN and SpiralRNN show similar behavior and

good prediction capabilities at least after longer training.

ESN outperforms the other approaches in the beginning

of the training phase. This can be explained by the fact

that the training of its output weights is basically a linear

regression task and the Kalman filter learning is essen-

tially the optimal method. However, ESN can hardly

improve its performance even for long training periods.

Obviously, for a good generalization it is necessary to

adapt the recurrent hidden layer weights as well, a fea-

ture shared by the three other architectures. The interme-

diate increase of the error for all approaches can be ex-

plained with the characteristics of the time series which

obviously explores new state space areas in these phases.

Another interesting result is the test with the Mackey-

Glass time series. Figure 5 shows the result for networks

with about 100 trainable weights and prediction tests for

up to 200 time steps. Here, SpiralRNN clearly outper-

forms the other approaches, even SRN. For smaller net-

work size with about 50 trainable weights, SRN is

slightly better than SpiralRNN, while both approaches

show almost as good behavior as SpiralRNN with about

100 weights; however, ESN and BDRNN perform simi-

larly poor. An example of autonomous output of the Spi-

ralRNN model on MackeyGlass time series is shown in

Figure 6 where the difference between autonomous out-

put and target data can be hardly identified until around

time step 400.

2.4 Stability

Stability of a trained SpiralRNN model has been demon-

strated by autonomous tests with deviated initial input

values. Recall that, in autonomous tests, the model uses

the previous output value as the current input value. The

only exception is the initial input value which is supplied

from outside.

Figure 7 shows the autonomous outputs generated by a

trained SpiralRNN model from the simulation with data

set Spike21. Each sub-figure in Figure 7 shows the

autonomous output result from simulation with different

initial input values ｛±0.1, ±1, ±10, ±100｝respectively.

Without exception, all models with different initial

inputs converge after an initial transition phase to the

trained attractor regime with periodic spike output with

period 21.

Figure 3. Modified logNMSE for the spike time series

with period 21. The networks have about 100 trainable

weights. The test points correspond to the times

33.544.55

{10 ,10,10 ,10,10 }

Figure 4. LogNMSE for the Lorenz time series and predic-

tion tests with 200 time steps. The networks have about 100

trainable weights. The test points correspond to the times

33.253.5 4.755

{10 ,10,10,...,10,10 }

RUDOLF SOLLACHER, HUAIEN GAO

Figure 6. The autonomous output of a trained SpiralRNN

model vs. corresponding target data. The solid line is auton-

omous output, and the dashed line is target data

Figure 5. LogNMSE for the Mackey-Glass time series and

prediction tests with 200 time steps. The networks have

about 100 trainable weights. The test points correspond to

the times

33.544.55

{10 ,10,10 ,10,10 }

(a) -0.1 (b) -1

RUDOLF SOLLACHER, HUAIEN GAO

(e) 0.1 (f) 1

(g) 10 (h) 100

Figure 7. Stability test of a trained SpiralRNN model from the simulation with the Spike21 data set. The sub-figures show the

results for different initial input values ｛−0.1, −1, −10, −100, +0.1, +1, +10, +100｝respectively. The X-axis is the time in the

autonomous tests, and the Y-axis is the output

As another example, Figure 8 depicts the trajectories of

two hidden-neuron activations during the autonomous

test with a trained SpiralRNN model from a simulation

with the Lorenz data set. In the test phase, the Spi-

ralRNN model, is iterated autonomously for 1000 time

steps starting with different initial input values. During

the iteration, the recurrent layer neuron states are re-

corded. Sub-figures in Figure 8 show the trajectories of

two neurons. The five initial time steps of the autono-

mous test are marked by black squares. Again, after

some initial transition phase the trajectories always con-

verge to an attractor corresponding to the Lorenz time

series. Note that the sub-figures have different scale in

order to display the initial transition path, and that the

position of the att r a c t o r is i n t h e r a n ge [ −0.2, 0.6] for

the X-axis and [−0.2, 0.6] for the Y-axis in all sub-fig-

ures.

3. Application Examples

3.1 NN5 Competition

NN5 competition1 puts emphasis on computational intel-

ligence methods in data prediction. The data for this

competition correspond to the amount of daily money

withdrawal from ATM machines across England. These

data exhibit strong periodical2 (ref. Figure 9) behavior:

1http://www.neural-forecasting-competition.com/

2Note that the Easter Fridays in 1996 to 1998 should have the indices

“19”, “376” and “753” in the given data and the Saturdays before

Christmas da y of 1996 and 1997 have the indices “283” and “ 648”.

RUDOLF SOLLACHER, HUAIEN GAO

Strong weekly periodic behavior dominates the

frequency spectrum, usually with higher values o

Thursday and/or Friday;

Important holidays such as the Christmas holi-

days (including the New Year holiday) and the

Easter holidays have a visible impact on the data;

Several of the time series such as time series

No. 93 and No. 89 show strong seasonal behavior,

i.e. a yearly period;

Some of the time series (like No. 26 and No. 48)

show a sudden change in their statistics, e.g. a shif

in the mean value.

The dynamics associated with data have deterministic

and stochastic components. In general, they will not be

stationary, as for example more tourists are visiting this

area or a new shopping mall has opened. There are in

total 111 time series in the database, with each time se-

ries representing the withdrawal from one ATM machine

and each data point in a particular time series indicating

the withdrawal amount of the day from the particular

ATM machine. All 111 time series were recorded from

the 18th March 1996 until 22nd March 1998, and there-

fore contain 735 data points. The task of the competition

is to predict the withdrawal of each ATM machine from

23rd March 1998 to 17th May 1998, 56 data points in

total. The evaluation of prediction performance is based

on the so-called SMAPE error value defined in Equation

(14), where t

and ˆt

are respectively the predicted

output and data.

(a) ｛−4, −4, −4｝ (b) ｛−4, −4, +4｝

RUDOLF SOLLACHER, HUAIEN GAO

(e) ｛+4, −4, −4｝ (f) ｛+4, −4, +4｝

(g) ｛+4, +4, −4｝ (h) ｛+4, +4, +4｝

Figure 8. Stability test of a trained SpiralRNN model from simulations with the Lorenz data set. Each sub-figure shows the

trajectory of two hidden neurons during autonomous tests with different initial inputs. The starting point coordinates are

given beneath each sub-figure. The X-axis and the Y-axis respectively refer to activation values of these two hidden neurons.

The first five positions of the trajectory are denoted by black square marks

1ˆ100

()2

ntt

smape ttt







%

 (14)

The competition did not require online learning. Nev-

ertheless, we used this competition to test our approach

against sophisticated offline learning methods. As prior

knowledge was available through the data sets and as

computational complexity was not limited for this appli-

cation, we have extended our approach in the following

way:

 (a) data were rescaled and additional artificial in-

puts were added pro viding periodicity information;

 (b) a “committee of experts” was employed in or-

der to reduce the prediction error due to local

suboptima;

 (c) the EKF learning process was modified taking

into account the SMAPE error function.

RUDOLF SOLLACHER, HUAIEN GAO

(a) sample 1

(b) sample 2

(d) sample 4

Figure 9. Sample data from NN5 competition dataset with

cross markers indicating Saturdays

The data presented to the neural network are mapped

to a useful range by the logarithm function. In order to

avoid singularities due to original zero values, they are

replaced with small positive random values. Additional

sinusoidal inputs are provided as a representation of cal-

endar information. These additional inputs include:

 Weekly behavior addressing feature 1

F: Refer to

Figure 10(a); the period is equal to 7.

 Christmas and seasonal behavior addressing feature

F and 3

F: It is often observed from the dataset

that, right after the Christmas holiday, withdrawal of

money is low and then increases during the year, fi-

nally reaching its summit value right before Christ-

mas. Seasonal features do not prevail in the dataset,

but they do exist in several of them, e.g. time series

9, 88. As both are regular features with a yearly pe-

riod, it makes sense to provide an additional input as

shown in Figure 10(b) which has period 365.

 Easter holiday bump addressing feature 2

F: The

Easter holidays did not have as much impact on the

data dynamics as the Christmas holidays did, but it

shows an effect on the usage of ATM in some areas

(shown in some time series). Furthermore, as the

58-step prediction interval includes the Easter holi-

days of year 19 98, the pr ediction over the holiday

RUDOLF SOLLACHER, HUAIEN GAO

 can be improved when the related data behavior is

learnt. This additional input uses the Gaus-

sian-distribution-shape curve to emulate the Easter

holiday bump as in Figure 10(c).

(a) Weekly-input

(b) Christmas-input

(b) Easter-input

Figure 10. Additional inputs of neural networks. On the

X-axis is the time steps, and Y-axis is the additional input

value

Figure 11. Online training of SpiralRNN in NN5 competi-

tion with additional inputs

Supplied with additional inputs, SpiralRNNs were

trained online (see Figure 11) such that data were fed-in

the network one-by-one and network parameters are

trained before the next data is fed-in.

Although SpiralRNN is capable of learning time series

prediction, the learned weights usually correspond to

local minima of the error landscape as mentioned in [11].

As computational complexity is not an issue for this

competition, a committee of experts ansatz is applied.

The committee of experts consists of several Spi-

ralRNN models with identical structure but different

initialization of parameter values. Each SpiralRNN

model operates in parallel without any interference to the

others. They are trained online using teacher forcing. The

importance of an expert’s output is determined by the

average SMAPE value over the last 56 steps. Therefore,

at time step t=6791, each model k produces a prediction

xk,t for the next 56 steps using its output as the input for

next time step. After that, online learning continues until

the end of the training data. Then the SMAPE error value

for the last 56 time steps is calculated according to Equa-

tion (15), where Es refers to the SMAPE error.

679

56 kt

kskt

Ex )







 (15)

1The value is calculated via: t = 795-56 = 679.

RUDOLF SOLLACHER, HUAIEN GAO

The final prediction for the 56 time steps following t =

735 consists of a weighted average of the committee

members’ autonomous predictions according to







 (16)

ttk k









 (17)

The whole procedure is listed in Table 1, and a sche-

matic diagram Figure 12 depicts the organization of the

committee.

As a last step we have to adapt the calculation of the

gradient matrix in Equation (5) to the error func-

tion



Table 1. Committee of experts

0. Initialize the n experts;

For a SpiralRNN model k, implement online

training with the data and make a 56-step

prediction Uk at time step t = 679. The pre-

diction value Uk will be compared to the data

in order to obtain the average SMAPE error

value according to Equation (15).

After the prediction at time step t = 679,

continue the online training till the end, and

produce another 56-step autonomous predic-

tion Vk.

3. Based on their k



values, combine the pre-

diction Pk according to Equation (17).

Figure 12. Committee of SpiralRNN experts where their

weights in committee are determined by their SMAPE val-

ues on testing dataset

Figure 13. Comparison between result and data showing

weekly behavior for time series 35. Dashed line with circles

is the data and solid line with squares is the prediction. The

x-axis shows the time in days

in Equation (14) according to Equation (20) with

()

ylnx



being the output and ˆ()

ˆt

tln

 the data in

logarithmic scale:

exp()exp( )

Sy y



 (18 )

exp( )exp( )Dy

 (19 )



exp( )()

sign D







 









(20)

For data sets with missing values we replace these

with predicted values and simultaneously skip the weight

update while still aggregating the gradient.

We have evaluated our approach using the last 56 data

of the available data while the remaining data were used

for online-training. In Figure 13, prediction and data are

compared for time series 35 which has a pronounced

weekly periodicity. Obviously, this periodicity can be

reproduced pretty well, even details like the small bump.

Seasonal behavior of data can also be learned as

shown in Figure 14. The curves in both sub-plots begin

with the values in Christmas holidays (with time indices

around 280 and 650). The rise and fall of the data during

the first about 100 days and a subsequent rise in Figure

14(a) can also be seen one year later in Figure 14(b).

Obviously, the model is able to capture this behaviour

pretty well.

RUDOLF SOLLACHER, HUAIEN GAO

Easter holidays can also be recognized by the trained

model as shown in Figure 15 for time series 110. Since

the expert-weight in the committee is determined by a

(a) seasonal-data previous year

(b) seasonal-data and prediction

Figure 14. Comparison between prediction and data for

time series 9 showing seasonal behavior. Dashed curve is

the data and solid curve is the prediction. The x-axis de-

notes time measured in days. The upper figure shows the

same period one year before

Table 2. Statistic results. The committee-averaged

SMAPE error value of the expert committee on all 111 time

series with varying number of experts

Figure 15. Prediction result (solid curve) and data (dashed

curve) for time series 110 showing a peak around Easter.

The x-axis displays time in days

period not containing Easter holidays, the prediction for

them might be expected to be less accurate. Neverthe-

less, as shown in Figure 15, the corresponding bump in

the data has been learned well. The Easter holidays in

1996 to 1998 occur at times around 20, 375 and 755. In

Figure 15, the prediction for the Easter holidays 1998

shows a similar behaviour as the Easter holidays in 199 6

and 1997 with a pronounced peak.

Table 2 shows the SMAPE errors on the test set (i.e.

the data from the last 56 time steps) for a varying num-

ber of experts. Obviously, the number of experts has no

significant influence on the average result. This indicates

that the learning process is quite reliable; in addition this

allows saving computational effort using a small com-

mittee. Figure 16 shows the histogram of committee-

averaged SMAPE values over the 111 time series in the

dataset. The majority of the results have a SMAPE value

around 20.

Our approach has won the time series forecasting

competition at WCCI20081 for both the full data set and

a reduced data set consisting of 11 time series. Among

the competitors was also commercial software currently

implemented in the ATMs. Although a few sophisticated

statistical and neural n etwork appro aches wh ich were no t

taken into account for the awards performed better, the

final results for the top 10 performing approaches are

quite close. The most important result for our approach is

that even with a remarkably simple extension our online

learning approach is competitive to world class statistical

and neural network solutions.

3.2 Mouse Tracking — A Toy Example

Besides using artificial data we also made experiments

1http://www.wcci2008.org

RUDOLF SOLLACHER, HUAIEN GAO

with real-life data. For this purpose we recorded the po-

sition of a computer mouse moved along a periodic tra-

jectory by a person and trained a SpiralRNN online with

these data. The background for this experiment is e.g.

applications in logistics and factory automation where

the motion of transported goods or robot arms has to be

tracked and analyzed.

For the mouse tracking we did some preprocessing on

the data in order to become independent of the velocity

of the mouse. For th is pur pose we took th e path- leng th of

the trajectory as the “time”-variable and interpolated the

position at equidistant length intervals. These positions -

after a suitable rescaling - are used as input for the Spiral

RNN during online training, trying to learn to predict the

next position on this discretized trajectory. Figure 17

illustrates such discretization. Figure 18 shows one par-

ticular trial. The blue line is generated by a user moving

the mouse along a lying figure 8. After about thirteen

iterations of this pattern the online training is stopped

and the Spiral RNN is fed with its previous output thus

generating the red trajectory. Obviously, th e Spiral RNN

has learned to reproduce the full trajectory. The mean

square error (MSE) is normalized to the variance of the

time series. The reader may notice that this pattern is

nontrivial in the sense that a certain amount of short-term

memory is necessary in order to select the correct direc-

tion at the crossing.

Such a pattern produces 2-dimensional periodical time

series as targets for the neural networks. For this exam-

ple we have also analyzed the frequency spec t rum of the

data and the predicted outputs of different neural net-

work models. The result is given in Figure 19 with blue

dashed and green solid lines for the two time series re-

spectively. SpiralRNN produces (refer to Figure 19(b))

autonomous output trajectories with a frequency spec-

trum similar to the frequency spectrum of the target data

Figure 16. Histogram (over 111 time series) of committee-

average SMAPE error

Figure 17. Segmentation by boundary points in MouseTrac k ing . The hollowed arrow indicates the movement direction, the

dashed line is the movement path of the cursor, circles represent the position measured by computer in every 1/100 second,

and black dots refer to boundary points that segment the move ment path of cursor. The identical length between boundary

points is “0.02 ”. The positions of these boundary points ar e take n as the training data at each “time” step

RUDOLF SOLLACHER, HUAIEN GAO

Figure 18. Mouse tracking example: The noisy line in the upper figure is the trajectory generated by the user; the smooth line

is the test with the Spiral RNN trained online and iterated with its own output as next input. The lower figure displays the

logarithm of the normalized one-step prediction mean square error which decays along training

(refer to Figure 19(a)), while the other models cannot.

If the training time was too short, the network usually

runs into the closest of a set of fixed points; two of them

are usually located within the two loops of the Figure

eight. This experience is confirmed with more complex

pattern where the required training time increases corre-

spondingly. One reason is the usually large amount of

noise due to manually moving the mouse. This noise

prevents establishing a large enough short-term memory

which may be required for correct prediction and favors

the development of fixed-point attractors.

Sharp turns in a trajectory are also difficult to repro-

duce. Typical examples are triangles or squares. Even if

the network has learned to reproduce such a pattern, it

frequency analysis of original trajectory frequency analysis of spiralRNN

(a) Target (b) SpiralRNN

RUDOLF SOLLACHER, HUAIEN GAO

fre

quency analysis of ESN frequency analysis of BDRNN

Figure 19. Frequency analysis of the output trajectory after 5000 training steps of different neural network models with

around 100 parameters. Note that the vector value at frequency “zero” is trivial and is therefore omitted in diagrams, and

that the period length of data is 130 therefore the dominant frequency is 008.0

130

1

always produces smooth turns and it takes much more

additional training to sharpen these turns.

3.3 Duty Cycle Adaptation in Wireless Sensor

Networks

In energy autarkic wireless sensor networks, achieving a

long lifetime requires the wireless sensors to switch off

individual components as often as possible. Most energy

is used by the transceiver in transmit mode as well as in

receive mode. Therefore, sensor nodes have organized

their activity in duty cycles consisting of an active period

and a passive period with most components switched off.

The longer the second passive period is, the smaller is

the duty cycle and the longer is the lifetime of the sensor

node.

One way of reducing the duty cycle effectively is to

skip listening to neighboring sensor nodes for their data

and instead use predicted values. For this reason, a sen-

sor node does not have to transmit its own sensor values

to its neighbors as well. This implies that sensor nodes

can increase their inactivity period thus reducing the duty

cycle. Of course, the prediction error will increase also

with decreasing duty cycles. Therefore, there will be an

optimal choice concerning prediction error and energy

saving.

We analyzed this application example using a simula-

tion of a room equipped with 10 temperature sensor

nodes and 4 heat sources as shown in Figure 20. The

varying strength of the heat sources was modeled by a

4-dim. chaotic Lorenz time series with 6000 time steps; a

short interval is shown in Figure 21. The diffusion of

heat from the heat sources into the room was modeled by

a simple diffusion model, discretized in space and time,

with random additive noise. An example for the resulting

sensor values is shown in Figure 22.

Each of the 10 sensor nodes is equipped with a Spi-

ralRNN supposed to predict the sensor node’s measured

temperature value and those of the neighboring nodes.

The SpiralRNN had 100 trainable weights for this

evaluation. The neighbor nodes are those nodes which

have a direct communication link to the sensor node (see

Figure 20). In regular intervals, the sensor nodes broad-

cast their current sensor value to their neighbors. One

complete exchange involving all 10 sensor nodes corre-

sponds to one time step in the simulation.

In order to reduce the duty cycle, the sensor nodes do

not listen to their neighbors every second time step. This

saves energy for communication: with a normal duty

cycle, a sensor node with k neighbors has to run its trans-

ceiver 2*(k+1) times in two intervals; for a reduced duty

cycle it is 2+k times in two intervals, a reduction by a fa-

ctor of 2



. With just one neighbor this is a factor of

RUDOLF SOLLACHER, HUAIEN GAO

Figure 20. Simulation environment with sensors (numbered circles) and heat sources (stars). Also shown are the communica-

tion links between sensor node s

Figure 21. Example of time series of heat sources. Only one period is shown. Similar pattern occur in other periods

Figure 22. Example of measured temperature values at the different sensors. Only one period is shown. Similar pattern occur

in other periods.

RUDOLF SOLLACHER, HUAIEN GAO

Figure 23. Comparison of relative prediction error for normal duty cycle (continuous line, circles) and reduced duty cycle

(dotted line, squares, listening only every second interval). Shown are mean, minimum and maximum of the rms error of 30

runs

0.75, with 4 neighbors it is a factor of 0.6, with 9

neighbors already 0.55 and quite close to the large-k

limit of 0.5.

However, the picture is not complete without looking

at the accuracy of the predicted instead of measured

sensor values. We performed 30 independent runs with

randomly chosen locations of sensors and heat sources,

random scaling of each of the heat sources with a factor

in the range [0,1] and additive noise of amplitude 0.01 at

each update of the heat diffusion on every grid point.

The result for the relative error - normalized for each

sensor to the variance of its sensor value time series - is

shown in Figure 23: For up to 5 pred iction steps the per-

formance with the reduced duty cycle is almost the same

as for the normal duty cycle. The outliers with a normal-

ized error larger than one are due to sensors with an al-

most constant sensor value whose variation is mainly due

to noise. This result indicates that there is even more

room for duty cycle reduction for this example: Listening

to the neighbors only every fourth interval yields a duty

cycle reduction of 4



; with 4 neighbors this gives a

reduction factor of 0.4, with 9 neighbors it gives 0.325

and the large-k limit is 0.25.

4. Conditional Prediction

So far, SpiralRNN turned out to be excellent in online

learning the dynamics of stationary processes. However,

in reality we often have to deal with non-stationary

processes. A typical example are cars approaching a

crossing: without knowledge about their destination one

usually can predict their approaching the crossing and

their leaving again, but not whether and into which di-

rection they turn at the crossing. Even our daily activities

consist of sequences of typical pattern like walking,

standing, talking, eating, and drinking and so on. Again,

without additional knowledge about the motivation of a

person usually it is impossible to predict the sequence of

these patterns, which can themselves be predicted very

well.

4.1 Application Example: Warehouse Logistics

We consider an application example from the logistics

domain: A warehouse stores goods of different type in

different areas (refer to Figure 24). Newly delivered

goods have to be transported to these areas by trucks; we

assume that these trucks always use the same trajectory

to a specific destination. Each truck is equipped with a

SpiralRNN learning the different trajectories. However,

different from previous examples we assume that there is

additional information available about the type of prod-

uct; for example, this could be provided by a RFID

reader at the entrance registering newly delivered goods

carrying RFID tags with product information. After hav-

ing learned the trajectories accompanied with this addi-

tional information, the Spiral RNN just receives this ad-

ditional information and has to recall the trajectory. Th is

can be used for diagnosis purposes (e.g. is it the correct

trajectory?) or for autonomous truck driving.

RUDOLF SOLLACHER, HUAIEN GAO

In addition to the current position coordinates, the in-

put vector of the Spiral RNN now also stores an addi-

tional trigger signal [12]. The network output is trained

to predict the changes in this vector. At occurrence of

each trigger signal, the hidden state of the network is

reset to zero and the input vector consists of some arbi-

trary, but otherwise fixed position vector and of the trig-

ger signal; the latter is kept at this initial value until an-

other trigger signal indicates completion of the trajectory

or start of a new trajectory. At subsequent steps, the

network - while in online learning mode - learns to pre-

dict the changes in position and change in the trigger

signal; the latter is actually zero. In recall mode, the pre-

vious output added to the previous input together repre-

senting the new position and trigger signal is fed back as

input for the next time step.

Figure 24．Application example “warehouse”: Depe nding on their type, newly delivered goods have to be stored in different

areas labeled by A-F. The trajectories used by trucks are shown as well

Figure 25. Conditional prediction of trajectory to area A (circles). The starting point of the trajectory is at [0.4,0].

RUDOLF SOLLACHER, HUAIEN GAO

Figure 26. Conditional prediction of trajectory to area B (circles). The starting point of the trajectory is at [0.4,0].

Figure 27. Conditional prediction of trajectory to area C (circles). The starting point of the trajectory is at [0.4,0].

Figure 28. Conditional prediction of trajectory to area D (circles). The starting point of the trajectory is at [0.4,0].

RUDOLF SOLLACHER, HUAIEN GAO

Figures 25-28 show the performance of a trained Spi-

ralRNN on reproducing the trajectories to the areas A, B,

C and D given just the initial trigger signal. The network

had 200 trainable weights and each trajectory was pre-

sented 20 times until the root mean square error (RMSE)

for each trajectory was below a threshold of 0.05, i.e.

half grid spacing. Obviously, the trajectories have been

learned pretty well.

4.2 Storage Capacity

Encouraged by the good performance of this approach,

we investigated the storage capacity of SpiralRNN in

dependence on the number of hidden neurons and the

length of the pattern time series. For this purpose we

created random two-dimensional trajectories with each a

different but randomly chosen trigger signal value. We

considered trajectories of 6 and 11 steps. The perform-

ance of a trajectory prediction is evaluated as above as

the MSE over the whole trajectory. We consider a num-

ber of trajectories trainable if within at most 30 trials

with each consisting of 50 presentations of each trajec-

tory the RMSE over all trajectories is at least once below

the threshold of 0.05. The whole process contains 5 steps,

as the following:

1) The experiment starts with small networks (The

value of nh starts from 6, such that the network

comprises 3 hidden units and each hidden unit has

2 hidden nodes); Training data is the combination

of np = 2 trajectory patterns.

2) With each pair of ｛nh ,np｝, at most 30 simulations

are performed, each of which contains 50 training

rounds. In each training round, th e learning model

is trained on-line with the presentation of these np

trajectory patterns, in a shuffled sequence but

each pattern only once. The hidden-state vector of

network is reset before the training with any tra-

jectory pattern starts. After every 10 training

rounds, a testing round starts, trying to reproduce

all np trajectories in an autonomous manner sepa-

rately.

3) The autonomous output results are evaluated. Let

lp be the length of trajectory patterns (the lp value

is identical to all pattern), xi,k be the ith-step ahead

prediction over the kth pattern, and ˆik



be the

respective target, the evaluation error ε is calcu-

lated as:

11 ˆ

ik ik















 





 (21)

4) If it satisfies 2

005





, this particular Spi-

ralRNN network with nh hidden nodes is claimed

to be capable of reproducing np number of trajec-

tory patterns of length lp, and thus the associative

memory satisfies Ma ≥np. The model (with the

same structure but re- initialized values) will then

be assessed with a training data with more pat-

terns, i.e. higher np value.

If 2

005





is not satisfied at least once within 30

simulations, it is claimed that SpiralRNN with nh

hidden nodes is not able to reproduce np number

of trajectory patterns with length lp. New experi-

ment will start with same value in np but a larger

SpiralRNN model by increasing the nh value by 3.

Note that the threshold value is set to 0.05, half

the grid distance 0.1, so that the prediction can be

rounded to the near est grid point if the error is less

than 0.05.

5) The experiment are continued until simulations

with nh = 60 are finished.

Figures 29 and 30 show the dependence of the number

of trainable pattern on the number of hidden neurons.

Obviously, the graphs show an almost linear dependence.

For short pattern of length 6 we get a relation np / nh =

0.64 ± 0.04 and for long pattern of length 11 we get np

/ nh = 0.31 ± 0.02.

5. Tiny OS Implementation

Finally we report on an implementation of SpiralRNN in

a typical wireless sensor platform. One example of such

a sensor platform is the TelosB[13] sensor mote from the

University of California at Berkley. It provides sensors

for temperature, light and humidity. A microprocessor

(Texas Instruments, MSP430, 16 bit, 8 MHz, 48kB ROM,

10kB RAM) allows to run software for communication,

data processing and energy management. A wireless

transceiver (Chipcon 2420, IEEE802.15.4, 2.4 GHz, 250

kbps) provides communication links to other sensor

motes or to a gateway.

RUDOLF SOLLACHER, HUAIEN GAO

Figure 29. Pattern storage capacity of SpiralRNN with dif-

ferent number of hidden nodes and with patterns of length 6

Figure 30. Pattern storage capacity of SpiralRNN with di-

fferent number of hidden nodes and with patterns of length 11

Figure 31. Structure of a typical smart sensor mote. TinyOS is the specific operating system for embedded system such as

smart sensor and provides interfaces among hardwares and softwares. NesC language, as a compiler, compiles and builds

softwares for different applications

We have deployed TinyOS, a small footprint operating

system developed at the University Berkeley, on such

motes. Programs developed for TinyOS are written in the

programming language NesC [13]. Figure 31 shows the

typical software architecture of a sensor mote. It allows

implementing even advanced functionalities like analysis

and learning on such embedded devices.

In order to accelerate the task execution, we replaced

the floating point operations by suitable fix-point opera-

tions. Thus we achieved a runtime for one online learn-

ing iteration of the SpiralRNN with a single input/output

node and 6 hidden nodes (totally 24 parameters in the

neural network) of roughly 300 milliseconds, a reduction

of about a factor 8 compared to a floating point arith-

metics implementation. Figure 32 shows a comparison of

the floating-point model and the fixed-point model with

respect to processing time and memory requirement

(RAM) for a TelosB sensor mote (both models have sin-

RUDOLF SOLLACHER, HUAIEN GAO

gle input and output neuron). The limited memory ca-

pacity (10 Kbytes) of such a sensor mote prevents de-

ployment of floating-point model images with more than

36 parameters or running it with more than 32 parame-

ters. In contrast, the fixed-point model takes less than

one second for each online learning step for networks

with up to 40 weight parameters (i.e. 10 hidden nodes).

Similarly, the fixed-point model requires less RAM al-

lowing working with more complex neural networks.

It is also interesting to compare the processing time of

the pure feed-forward step with the complete online

learning step. Figure 33 compares two fixed-point mod-

els with feed-forwarding (FFD) only and with full online

learning (FULL) including feed-forwarding, gradient

calculation and extended Kalman filter update. Obvi-

ously, the full online learning step requires much more

time; while a pure feed-forward step requires only 11ms

for 48 weight parameters, the full online learning step

takes about 1500ms for the same network. As a conse-

quence, one can reduce the number of learning steps

when the neural network is well trained and thus save

quite some energy.

We close this section with a comparison of a theoreti-

cal estimate of the computational complexity with the

measured times for one update step. Measured in multi-

ples of the summed effort for an average elementary

fix-point or floating point operation consisting of multi-

plication, addition and potentially some copying, the

feed-forward iteration cost is proportional to 2



with nh the number of hidden neurons

and d the dimension of the input and output. This can be

expressed in terms of nw=(2d+2)nh+d－1, the number of

weights, leading to nh=(nw－d+1)/(2 d+2). The effort for

the derivative calculation is proportional to

and the effort for the extended

Kalman filter updates is proportional to 23

dddn

(21) h

dn

(

nn

d

dn )

Although the complex of the matrix inversions in

full online learning

pared with the measured proc-

dn

352

www

ndndnn .

ity

uation (4) scales as 3

d, in practice, this is not the

dominant contribution asw

nd≫ can be assumed. The

dominant contribution for tl online learning step

comes from the term 23 2

(2 2)

hw w

nn nd  of the gra-

dient calculation of the Kal

pure feed-forward step the leading term is

22 2

(2 2)

nn d , i.e. a factor nw smaller than for the

step.

These estimates are com

he ful

man filter update. For the

sing times in Figure 34. Here, the processing times are

plotted against the estimated effort for d=1 and varying

nw. The linear relationships and the very similar propor-

tionality coefficients confirm the theoretical estimates.

As a by-product we get a typical processing time per

elementary optimized floating-point operation of about

0.063ms.

(a) Processing time

(b) RAM requirement

Figure 32. Comparisons of processing time and RAM Re-

amounts parameter in model

quirement between floating-point models and fixed-point

models that installed in TelosB sensor motes with different

RUDOLF SOLLACHER, HUAIEN GAO

Figure 33. Comparison of processing time between models

with feed-forward functionality only and with full function-

hown excellent prediction perform-

ance and stability during online learning making them a

rediction. Here, different temporal pattern are

ction tasks. It turned out that the

tation of SpiralRNN on a typical wireless sen-

sor platform. A significant reduction of the runtime has

ne learning rule can be derived from a Bayesian

the probability density

ality (including feed-forwarding, gradient calculation, ex-

tended Kalman filter learning)

6. Conclusions

Spiral RNN have s

good candidate for application in autonomous systems. A

first example are wireless sensor networks where Spi-

ralRNN can be used to reduce duty cycles and thus en-

ergy consumption leading to an increased battery life-

time.

Another interesting and not so rare use case is condi-

tional p

companied by an additional constant signal which can

be derived from an initial trigger signal. This approach

allows training different pattern reliably provided a trig-

ger signal is available. Such trigger signals can also be

used to suspend online training, e.g. in periods with just

noise on the input.

We have analyzed the storage capacity of SpiralRNN

in conditional predi

mber of successfully trained temporal pattern is pro-

portional to the number of hidden neurons. This result is

qualitatively similar to the memory capacity of Hopfield

networks [15]. As expected, the storage capacity for

longer pattern is smaller than the one for shorter pattern.

Further investigations might clarify whether there is a

typical storage capacity per hidden node and pattern

length.

Finally, we have reported about an embedded system

implemen

en achieved by applying fix-point arithmetics. This

allows developing solutions based on online learning of

dynamics with reasonably complex models. A typical

example is duty cycle adaptation in wireless sensor net-

works.

Appendix

The onli

approach forˆ

wX of the

weights w given all previous data 10tt

t

 

xx x

ˆ{}

ˆˆ ˆ

(a) Feedforward

(b) Full online learning step

Figure 34. Processing time as a function of the theoretical

estimate of computational complexity for fixed-point im-

plementation: fnline learning

step (bottom). The linear behavilmost identical

eedforward (top) and full o

or and the a

proportionality factors of 0.063 ms/operation indicate the

validity of the theoretical estimate

RUDOLF SOLLACHER, HUAIEN GAO

We start with a generic dynamic system:



1ˆt

ttt



ssw

H (22 )



ˆˆ

tt t





sw

G (23)

t1tt





ww

Equation (22) describes the dynam

system by a (hidden) state vector

of depends in some, usuallynlinear way on the

(24)

ic evolution of the

s at time t. The value

evious hidden state vector 1t

s, on the previous ob-

servation vector 1

ˆt

x and on the del parameters t

Equion (23) describes the measurement process: ˆt

x is

the vector of observations (sensor data) at time

t, (. . . )G is the el estimation of these observations

based on t

s, t

w and 1

ˆt

x. The variable t

mod



the

measurement noise vector, assumed to be normally dis-

trith a probability distribution function (p.d.f.)

ibuted w

(

0 )



NFinallyuation (24) descris the e-

volution of the model parameters: that the dynamics of

the environment within a reasonable time window are

. , Eqbe

assumed to be stationary, i.e. the model parameters are

static, up to additive random fluctuations t



; these are

assumed to be normally distributed with zero mean, co-

variance matrix t

Q and thus with a corresponding p.d.f.

(0)

ttt



N.

Using the Bayes rule, the chain rule and the Chap-

man-KolmogorovEquation (16), the conditional p.d.f.

wXfor the moderameters at time t given the current

l pa

d all previous observations 10

ˆ{}

ˆˆ ˆ

t



xx x

X is

given by following equations:

ˆˆ



(25)





ˆˆˆ

ˆˆ

ttttt

tt1t

ffd



s (26)







wswsw

XXX

111

ˆˆˆ

tttt

ttt

fffd











wwww

XXX

(27)

Equation (25) is the famous Bayes rule, Equation (26)

introduces the hidden-state vector into the “evi-

dence” p.d.f. and Equation (27) is the “prio

the model parameters . The last equation

r” p.d.f. for

introduces

e conditional p.d.f. 11

f

wX which allows to interpret

Equation (25) to Equation (27) as iterative update equa-

tions for ˆ

wX.

Before the on-line training rules from these update

equations are derivedas to introduce some as-

sumptions and approximations for the specification of

each p.d.f

, one h

111

ˆ11

() ()

ttt

att









XNis assumed to be

normally distributed with mean 1t



w and co-

variance matrix 1t

P. In general this p.d.f. will

be multi-modal; however, if the environment

doesn’t change its dynamics

he m

eco

too fast and if the

model is powerful enough, then todel pa-

rameters should bme static and this assump-

tion is justified.

ˆ1

( )

tt tttt

()



 





ww ww

XN according to the a-

ssumption made in Equation (24).

with

ious

121

()(( (...)))

ˆˆ

tt ttt

ctt





  

Ssww

XHH

the dots indicating an iterat

e prev

hidden states

ion of the function

()H which further represents th





12...



s according to

uation (22). The Dirac delta-function

eq-

()



corresponds to a normal distribution with in-

simally small covariance matrix and re-

flects the ass any uncertainty

comes from the measurements and from (

dom) changes of model parameters.

ˆtttt

finite

umption that

ran-

()



sw

xX in Equation (26) is normal distributed

with mean value 1

()

ˆt

tt

swx

G and covariance

matrix t

R according to Equation (23)

, i.e.





()

ˆˆ

ˆttt ttt t







 sw

NG .

the specifications





b, Equation (27)

olution normal distributions and thus f





wX

With and

is a conv of

can be easily calculated:

†

†1

ttt

tt t

PPQ

















www

XN (28)

RUDOLF SOLLACHER, HUAIEN GAO

Given the



-function lik.f. 1

f in specifica-

tion



S and the p.d.f. of 1

ˆttt t

f

sw

xX in specification



S, one finds:

e p.dˆ

tt t

swX





ˆˆˆ

ttttt

tt t

R











 





 

xsw

XNG (29)



ˆˆ

ade in s

sww



 

HH (30)

The same assumption as mpecification





f

wX - namfor ely the slow change of the m

ram when the model has been well trained -

ju er approximation: the function

odel pa-

eters t

stifies anot

h ()



G can

be rized with respect to their dependence

mters . In particular,

linea

odel param on the

()()

ˆtt

tttt

g

swxww

where the output vector t

x and the gradient matrix

are respectively defined as:

(31)



†1

1and

ˆt



 

G (32)

d

Finally, the definition of is given here.

†

tio

According

to Equation (30) and Equan (31), †

s should be the

hidden state vector calculated with the previous model

rameters 1t

w. However, this requires a recalculation

of the whole history wheer the model parameters

change and therefore is not a viable slution fo

learning in distributive sep Instead, the

app

nev

nsor a

plication. r on-line

roximation in Equation (3is applied, which implies

at the model takes †

s as tflawless hidden value.



†21

ˆˆ

t



 sww

HH (33)

With the approximations in Equation (31), Equation

(32) and Equation (33), the calculation of 1

ˆttt



w

x in

Equation (29) is rewritten as:

†1

(( )

ˆˆ

()

ttt





 

 

), the p.

fw

(34)

Combining Equation (28) and equation (34d.f.

wX in Equation (25) ends up with a normal distribu-

Equation (35). Note that is computed

n (28) and is the gradient mtrix.

tion in

Equatio

†



ˆ()

ttt





www

X 35) (



1ˆt

tt tt t



x

ww x

1T (36)

†

†† †

tttt

PPRP











 

(37)

Equation (35) is consistent with specificati





whilst Equation (36) and Equation (37) constitute t

extended Kalman filter (EKF)

[16,17,18]).

REFERENCES

[1] T. Schott, “Global megatrends and their effects on

production of the future,” in Conference Proceedings, 5th

wikipedia.org/wiki/Ambient_intelligence.

[3] X. Cai, N. Zhang, Grthy, and D. C. I.

Wunsch, “Timh recurrent neural

networks using a hybrid pso-ea algorithm,” in Proceed-

f IEEE International

orks IJCNN ’05, R.

pp. 815–

analyzing and

(for more details see

the

International Conf ere nce on Ind ustrial Informatics, Vienna,

Austria, Vol. 1, pp. 18, July 23-27, 2007, http://www.

indin2007.org/keynote_schott.php.

[2] Wikipedia, Ambient intelligence.

http://en.

. Venayagamoo

e series prediction wit

ings of IEEE International Joint Conference on Neural

Networks, N. Zhang, Ed., Vol. 2, pp. 1647–1652, 2004.

[4] H-G. Zimmermann, H-G. Zimmermann, R. Grothmann,

A. Schafer, and C. Tietz, “Dynamical consistent recurrent

neural networks,” in Proceedings o

Joint Conference on Neural Netw

Grothmann, Ed., Vol. 3, pp. 1537–1541, 2005.

[5] J. L. Elman, “Finding structure in time,” Cognitive Sci-

ence, Vol. 14, No. 2, pp. 179–211, 1990.

[6] A. Waibel, T. Hanazawa, G. Hinton, and K. Shikano,

“Phoneme recognition using time-delay neural net-

works,” IEEE Transactions on Acoustics, Speech an

Signal Processing, Vol. 37, No. 3, pp. 328–339, 1989.

[7] P. Mastorocostas and J. Theocharis, “On stable learning

algorithm for block-diagonal recurrent neural networks,

part 1: The RENNCOM algorithm,” IEEE International

Joint Conference on Neural Networks, Vol. 2,

820, 2004.

[8] H. Jäger, “The ‘echo state’ approach to

training recurrent neural networks,” German National

Research Center for Information Technology, Technical

Report GMD 148, 2001.

[9] H. Gao, R. Sollacher, and H-P. Kriegel, “Spiral recurre

neural network for online learning,” in Proceedings of the

15th European Symposium on Artificial Neural Networks,

RUDOLF SOLLACHER, HUAIEN GAO

ms and D. Zipser, “A learning algorithm for

networks,” in Proceedings of

nal Intelligence and

rna-

d S. S.

ESANN’2007, April 25-27, 2007, Bruges, Belgium, M.

Verleysen, Ed., pp. 483–488, 2007.

[10] R. J. Willia

continually running fully recurrent neural networks,”

Neural Computation, Vol. 1, pp. 270–280, 1989.

[11] R. Sollacher and H. Gao, “Efficient online learning with

spiral recurrent neural

Venk

World Conference on Computational Intelligence WCCI ’08,

2008.

[12] H. Gao and R. Sollacher, “Conditional prediction of time

series using spiral recurrent neural network,” in Proceed-

ings of the European Symposium on Artificial Neural

Networks-Advances in Computatio

stoc

Learning, Bruges , Belg ium, April 23 -2 5 , 2008.

[13] J. Polastre, R. Szewczyk, and D. Culler, “Telos: Enabling

ultra-low power wireless research,” in The Fourth Inte

tional Conference on Information Processing in Sensor Net-

works: Special track on Platform Tools and Design Methods

for Network Embedd ed Sens ors, p p. 364– 369, 2005 .

[14] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer,

and D. Culler, “The nesC language: A holistic approach to

networked embedded systems,” ACM SIGPLAN Notices,

Vol. 38, No. 5, pp. 1–11, 2003. http://www.tinyos.net/.

[15] R. J. McEliece, E. C. Posner, E. R. Rodemich, an

atesh, “The capacity of the Hopfield associative

memory,” IEEE Tra nsact ions on Information Theory, Vol.

33, No. 4, pp. 461–482, 1987.

[16] F. Lewis, “Optimal estimation: With an introduction to

hastic control theory,” A Wiley-Interscience Publica-

tion, ISBN: 0-471-83741-5, 1986.

[17] R. Kalman, “A new approach to linear filtering and pre-

diction problems,” Transactions of the ASME–Journal of

Basic Engineering, Vol. 82, pp. 35–45, 1960.

[18] G. Welch and G. Bishop, “An introduction to the Kalman

filter,” University of North Carolina at Chapel Hill, De-

partment of Computer Science, Technical Report, pp.

95-041, 2002.