J. Intelligent Learning Systems & Applications, 2009, 1, 1-27
Published Online November 2009 in SciRes (www.SciRP.org/journal/jilsa)
Copyright © 2009 SciRes JILSA
Towards Real-World Applications of Online Learning
Spiral Recurrent Neural Networks
Rudolf SOLLACHER1, Huaien GAO2
1Siemens AG, Corporate Technology, Munich, Germany; 2Department of Computer Science, Ludwig-Maximilians University, Mu-
nich, Germany.
Email: Rudolf.Sollacher@siemens.com, gheken@hotmail.com
Received October 21st; 2008; revised January 19th, 2009; accepted January 21st, 2009.
ABSTRACT
Distributed intelligent systems like self-o rganizing wireless sensor and actuator networks are suppo sed to work mostly
autonomous even und er changing enviro nmenta l conditio ns. This requires ro bust and efficient self-learning capabilities
implemen table on embedded sys tems with limited memory and computational power. We present a new solutio n called
Spiral Recurrent Neural Networks (SpiralRNN) with an online learning based on an extend ed Kalman filter and gradi-
ents as in Real Time Recurrent Learning. We illustrate its stability and performance using artificial and real-life time
series and compare its prediction performance to other approaches. SpiralRNNs perform very stable and show an ac-
curacy which is superior or similar to other state-of-the-art approaches. In a memory capacity evaluation the number
of simultaneously memorized and accurately retrievable trajectories of fixed length was counted. This capacity turned
out to be a linear function of the size of the recurrent hidden layer, with a memory-to-size ratio of 0.64 for shorter tra-
jectories and 0.31 for longer trajectories. Finally, we describe two potential applications in building automation and
logistics and report on an implementation of online learning SpiralRNN on a wireless sensor platform under the
TinyOS embedded operating system.
Keywords: recurrent neural network, online learning, prediction, sensor actuator network
1. Introduction
Visions like the “intelligent factory” [1] or “ambient in-
telligence” [2] rely heavily on embedded systems with a
high degree of autonomy. Of particular interest are en-
ergy autarkic self-organizing wireless sensor networks
due to their simplified installation and robustness during
operation. These consist of so called “sensor nodes”,
hardware platforms with usually a battery, a microproc-
essor, a transceiver module and one or more sensors.
They are able to communicate wirelessly with each other;
suitable protocols enable them to set up a network by
themselves. If necessary, data are relayed by other nodes
to a destination. Data may also be processed locally or
inside the network.
A very useful kind of data processing is forecasting
data. Examples for applications in building automation
and logistics are given at the end of this paper. Solutions
for this task have to fulfill at least the following require-
ments in order to be really useful:
R1 The limited computational power and memory of
these sensor nodes puts a limit on the computa-
tional complexity of the forecasting solution.
R2 As the environmen t of the sensor networ k is usu-
ally unknown in adv ance th e solution has to learn
a forecast model online and in an efficient way.
Neural networks, and in particular recurrent neural
networks, have pro ven their suitability at least for offline
learning forecast tasks. Examples can be found in [3] or
[4]. However, traditional approaches like Elman or stan-
dard recurrent neural networks (SRN) [5], time delay
neural networks (TDNN) [6], block-diagonal recurrent
neural networks (BDRNN) [7] or echo state neural net-
works (ESN) [8] have deficiencies with respect to the
requirements above. For a TDNN one has to specify the
length of the history to be considered as input. If this is
chosen too small, some time series like spike time series
with large inter-spike interv als cannot be predicted relia-
bly. Elman networks avoid this problem due to their re-
current hidden layer; this allows them to adjust their
short term memory in a data driven way by online learn-
ing. However, their basically unbounded recurrent
weight matrix sometimes leads to dynamically unstable
behavior making them less attractive for online learning
RUDOLF SOLLACHER, HUAIEN GAO
2
purposes. Echo state networks try to cure this deficiency
by choosing random but constant weights for the recur-
rent weight matrix, but such that its ab solute eigenvalues
are smaller than one; only the weights of the outpu t layer
are trained. Unfortunately, with this approach the recur-
rent layer has to be large enough in order to provide a
rich reservoir of responses from which to construct the
output signals. Block diagonal recurrent neural networks
try to cure the dynamical instabilities by a specific struc-
ture of the recurrent layer. The corresponding weight
matrix is zero except for 2×2 diagonal block matrices.
These blocks can be trained but there is a constraint on
their maximum eigenvalue. This constraint is either im-
plemented implicitly via training or explicitly via a suit-
able parametrization of the weights. For online learning
the scaled orthogonal version described in [7] is more
useful. However, this structure seemed to be too con-
strained leading to poor performance for some prediction
problems. This motivated us to develop a new approach
called Spiral Recurrent Neural Network (SpiralRNN)
[9].
2. Spiral Recurrent Neural Networks
(SpiralRNN)
2.1 Architecture
Like other Recurrent Neural Network (RNN) models,
SpiralRNN consists of an input layer feeding into a re-
current hidden layer and an output layer which receives
input from the recurrent hidden layer. The hidden neuron
states t and the output neuron states satisfy the
following u pdate equations:
st
x
11
()
thidtinth
gW W

ssxb
id
)
(
tout tout
hWxsb (1 )
Here, and are bias vectors, matrices,
and represent synaptic connections between
layers and
hid
b
out
W
out
bin
W
hid
W
()
g
and are activation functions. Us-
ing the data instead of the previous network output
as input in Equation (1) corresponds to teacher
forcing; this is used when ever data are available.
()h
1
ˆ
xt
1t
x
The main difference to other approaches lies in the
structure of the recurrent hidden matrix . Motivated
by the analysis underlying the development of echo state
neural networks (ESN) [8], we have chosen a constraint
on the weights of the recurrent layer in order to guaran-
tee stability, respectively the “echo state” property, while
still allowing the weights to be trained. With the follow-
ing structure,
hid
W
11
1
1
1
1
0...
0
…0
l
hid l
lll
W









 (2)
with l the number of hidden nodes in the recurrent layer,
tanh( )[11]
kk
kl

,k
trainable parameters
and a suitable fixed , one can guarantee a
bounded eigenvalue spectrum of and thus stability
in the sense of the “echo state” property:
R
Whid
11
11
()(1
ll
kk
kk
tanh l
 


)
 

(3)
The structure of is best illustrated with Figure 1,
also giving an idea for the choice of the name “Spi-
ralRNN”.
hid
W
It turned out that a block structure in case of several
inputs is a good compromise concerning computational
complexity and performance (see Figure 2). Although
the recurrent hidden layer has this block structure, all
input neurons are coupled to all hidden neurons; the
same holds f o r t he out put neurons.
2.2 Online Learning
Due to the second requirement mentioned in the intro-
duction the learning of the neural network weights has to
be online. We understand online learning as a continuous
update of weights controlled only by the data. Realtime
Recurrent Learning (RTRL) is based on an approximate
error gradient taking into account all previous steps; it
sometimes shows slow or poor convergence. Particle
filtering and evolutionary approaches require much memory
Figure 1. Structure of SpiralRNN recurrent hidden layer
for 1-dimensional input; only the connections rooted in one
neuron are shown
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
Copyright © 2009 SciRes JILSA
3
Figure 2SpiralRNN for 3-dimensional input; the recurrent hidden matrix is split into 3 blocks but all other layer s are fully
connected
The last line in Equation (6) is the gradient calculated
in real time recurrent learning (RTRL) [10]. This ap-
proximation is justified if the weight vector is changing
slowly which holds at least for the later phase of the
converging learning process.
in order to provide reasonable behavior. As explained in
[9] the optimal choice is an ex tended Kalman filter (EKF)
based on gradients calculated like in Realtime Recurrent
Learning (s. appendix).
According to this derivation, the iterative update rule
for the weight vector , representing all weight matri-
ces , and and all bias vectors and
, is the following:
t
w
out
W
in
Whid
Whid
b
out
b
It is worth mentioning that these iterative Kalman fil-
ter equations for the update of expected weight values
and their covariance matrix are derived analytically un-
der three reasonable approximations: 1) the distribution
of noise on the data is Gaussian, 2) the weights change
slowly and 3) the distribution of weights representing
their uncertainty is unimodal and also Gaussian. While 1)
is a standard assumption, the other two approximations
deserve some explanation: 2) is true if the process gener-
ating the data is stationary and if the model is powerful
enough to reproduce the data; only in the initial phase of
learning and for very non-stationary processes this as-
sumption might be questionable. Usually, 3) is only true
for a local neighborhood around the true weight vector;
in general, the distribution will be multi-modal implying
several competing weight vectors and thus models.
However, for complexity reasons we have to focus on
one set of weights; we have no guarantee that we get the
optimal model but we will find a “locally” optimal
model in the sense, that small changes of the weights
lead to a decrease in performance. As a consequence we
will end up with a distribution of different models in
repeated experiments with different initial conditions or
noise. The reader should notice, that there will be always
several degenerate solutions, because a renumbering of
1
†† †1
1
1
()
ww( x)
ˆ
tt t
TT
tt tttt
Tt
tt ttt
PPQ
PPPPRP
PR
x

 
 
(4)
t
x is the output vector of the Spiral RNN and are
the data. Pt is the covariance matrix of representing
its uncertainty and is the incremental increase in this
uncertainty. is the measurement error covariance
matrix. represents the gradient of the output
w.r.t. (notice the indexing):
ˆt
x
t
w
t
Q
t
R
1t
t
x
w
11
ttt
tttt
dd
dd


 

xxs x
wsww
1
t
(5)
2
1
11
1
1
111


t
t
t
t
t
t
t
t
t
t
t
t
t
t
dw
ds
s
s
w
s
dw
ds
s
s
w
s
dw
ds
(6)
RUDOLF SOLLACHER, HUAIEN GAO
4
the hidden neurons yields exactly the same dynamical
model but the weight vector is changed; the components
have just been reshuffled. This is actually an advantage
for learning because it is sufficient to reach one of these
degenerate solutions.
2.3 Performance
Spiral RNN have been compared to other architectures
like ESN, Elman networks and BDRNN [9]. TDNN is
not included due to the need to fix the number of historic
inputs in advance. The generalization ability of these
networks was tested with three artificial time series: A
spike time series with spike intervals of 21 time steps (20
zeros followed by a one), and the chaotic Mackey-Glass
and Lorenz time series. The one-dimensional Mackey-
Glass time series is given by
()
() ()
1( )
c
ax t
x
t
xt


bxt (7)
with the parameters a = 0.2, b = 0.1, c = 10 and
=17.
The differential equation was solved numerically using
Euler’s method with step size 1. All time series values
before start time were set to 0.5. The time series was
normalized by its standard deviation.
The three-dimensional Lorenz time series is defined
by the following equations:
()
x
sy x
(8)
()yrzxy 
(9)
zxybz
(10 )
The parameters were set to s = 16, r = 40, b = 6 and
the initial values were x0 = y0 = 0.1 and z0 = 0.1. Inte-
gration was done using Euler’s method with step size
0.01. The data were finally scaled down by factor of 20.
All time series are corrupted - af ter normalization - by
normally distributed noise with a standard deviation of
0.01.
As the computational complexity depends mainly on
the online learning step we compared different networks
with roughly the same number of trainable parameters.
For the activation function in the hidden layer we took
and in the output layer we took the identity map.
The blocks in the diagonal of the recurrent weight matrix
for the BDRNN were parameterized as
tanh()
tanh( )sin()tanh( )cos()
tanh()cos( )tanh()sin( )
 
with a pair of trainable parameters and
indep-
endently for each block.
For the ESN model, the recurrent weight matrix was
initialized with random numbers uniformly distributed in
the interval [1,1] with a sparsity degree of 95%; the
matrix was rescaled such that the maximum absolute
eigenvalue was 0.8. All other weights are initialized with
random numbers uniformly distributed in [0.1,0.1].
For SpiralRNNs, a value 11l
guarantees that the
maximum eigenvalue of the recurrent weight matrix is
smaller than one. However, the actual eigenvalues can be
much smaller than one. Experiments have shown good
and reliable performance with 1
.
The online learning method was the same as described
above for all networks with appropriate gradients de-
pending on the architecture. For the Kalman filter, the
covariance matrix Pt was initialized with the identity
matrix Id and the model noise covariance matrix
Qt=108Id was chosen to be constant. The measurement
covariance matrix Rt was initialized to 102Id and up-
dated according to
1
(1)e eT
tt
RR


t
(12 )
with ˆ
ex x
t
the forecast error at time t and 001
.
During online learning we used teacher forcing. After
some specific time steps we performed multi-step fore-
cast tests using the output as input for the next time step.
After each test we returned to the beginning of the test
and continued online learning with teacher forcing.
The prediction performance during the tests is meas-
ured with the logarithmic normalized mean square error
(logNMSE):
0
0
2
10
11
11 log ()
ˆ
tt
d
it
it
itt
xx
dt
2



 (13)
Here, d is the dimension of the data, t0 is the time be-
fore the test phase starts, t
is the maximum number of
test steps and
is the standard deviation of the whole
data set. This definition of an error measure takes into
account that the prediction error in the test phases in-
creases with time; as we do not want the error measure to
be dominated by the large errors of the future prediction
steps we have chosen a measure corresponding to a
geometric mean instead of a traditional algebraic mean.

(11)
For the spike time series we performed tests with a
duration of t
=2000 time steps. For this time series
many forecast models tend to predict a small constant
value close to 1/21. This can give a small error as this
Copyright © 2009 SciRes JILSA
5
RUDOLF SOLLACHER, HUAIEN GAO
value is close to zero for 20 out of 21 time steps. There-
fore, we evaluated the error measure for this particular
time series only with the time steps when the spikes oc-
curred; this provides a much more informative measure.
Figure 3 shows the results for networks with about
100 trainable weights. Obviously, SRN and SpiralRNN
can predict the spike time series at least after 104 time
steps (recall that this corresponds to about 500 spike pe-
riods). On the other hand, ESN and BDRNN fail to pre-
dict the spike trains properly. Not quite unexpectedly,
SRN, having more degrees of freedom, performs best.
However, the results for SRN include only those runs
which were successful; after longer training a significant
fraction of runs failed due to dynamical instabilities. An
encouraging observation is that SpiralRNN is pretty
close in performance to the successful SRN runs, taking
into account the logarithmic error measure.
Figure 4 shows the results for the Lorenz time series
with tests consisting of 200 time steps. Here, SRN,
BDRNN and SpiralRNN show similar behavior and
good prediction capabilities at least after longer training.
ESN outperforms the other approaches in the beginning
of the training phase. This can be explained by the fact
that the training of its output weights is basically a linear
regression task and the Kalman filter learning is essen-
tially the optimal method. However, ESN can hardly
improve its performance even for long training periods.
Obviously, for a good generalization it is necessary to
adapt the recurrent hidden layer weights as well, a fea-
ture shared by the three other architectures. The interme-
diate increase of the error for all approaches can be ex-
plained with the characteristics of the time series which
obviously explores new state space areas in these phases.
Another interesting result is the test with the Mackey-
Glass time series. Figure 5 shows the result for networks
with about 100 trainable weights and prediction tests for
up to 200 time steps. Here, SpiralRNN clearly outper-
forms the other approaches, even SRN. For smaller net-
work size with about 50 trainable weights, SRN is
slightly better than SpiralRNN, while both approaches
show almost as good behavior as SpiralRNN with about
100 weights; however, ESN and BDRNN perform simi-
larly poor. An example of autonomous output of the Spi-
ralRNN model on MackeyGlass time series is shown in
Figure 6 where the difference between autonomous out-
put and target data can be hardly identified until around
time step 400.
2.4 Stability
Stability of a trained SpiralRNN model has been demon-
strated by autonomous tests with deviated initial input
values. Recall that, in autonomous tests, the model uses
the previous output value as the current input value. The
only exception is the initial input value which is supplied
from outside.
Figure 7 shows the autonomous outputs generated by a
trained SpiralRNN model from the simulation with data
set Spike21. Each sub-figure in Figure 7 shows the
autonomous output result from simulation with different
initial input values ±0.1, ±1, ±10, ±100respectively.
Without exception, all models with different initial
inputs converge after an initial transition phase to the
trained attractor regime with periodic spike output with
period 21.
Figure 3. Modified logNMSE for the spike time series
with period 21. The networks have about 100 trainable
weights. The test points correspond to the times
33.544.55
{10 ,10,10 ,10,10 }
Figure 4. LogNMSE for the Lorenz time series and predic-
tion tests with 200 time steps. The networks have about 100
trainable weights. The test points correspond to the times
33.253.5 4.755
{10 ,10,10,...,10,10 }
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
Copyright © 2009 SciRes JILSA
6
Figure 6. The autonomous output of a trained SpiralRNN
model vs. corresponding target data. The solid line is auton-
omous output, and the dashed line is target data
Figure 5. LogNMSE for the Mackey-Glass time series and
prediction tests with 200 time steps. The networks have
about 100 trainable weights. The test points correspond to
the times
33.544.55
{10 ,10,10 ,10,10 }
(a) -0.1 (b) -1
(c) -10 (d) -100
7
RUDOLF SOLLACHER, HUAIEN GAO
(e) 0.1 (f) 1
(g) 10 (h) 100
Figure 7. Stability test of a trained SpiralRNN model from the simulation with the Spike21 data set. The sub-figures show the
results for different initial input values 0.1, 1, 10, 100, +0.1, +1, +10, +100respectively. The X-axis is the time in the
autonomous tests, and the Y-axis is the output
As another example, Figure 8 depicts the trajectories of
two hidden-neuron activations during the autonomous
test with a trained SpiralRNN model from a simulation
with the Lorenz data set. In the test phase, the Spi-
ralRNN model, is iterated autonomously for 1000 time
steps starting with different initial input values. During
the iteration, the recurrent layer neuron states are re-
corded. Sub-figures in Figure 8 show the trajectories of
two neurons. The five initial time steps of the autono-
mous test are marked by black squares. Again, after
some initial transition phase the trajectories always con-
verge to an attractor corresponding to the Lorenz time
series. Note that the sub-figures have different scale in
order to display the initial transition path, and that the
position of the att r a c t o r is i n t h e r a n ge [ 0.2, 0.6] for
the X-axis and [0.2, 0.6] for the Y-axis in all sub-fig-
ures.
3. Application Examples
3.1 NN5 Competition
NN5 competition1 puts emphasis on computational intel-
ligence methods in data prediction. The data for this
competition correspond to the amount of daily money
withdrawal from ATM machines across England. These
data exhibit strong periodical2 (ref. Figure 9) behavior:
1http://www.neural-forecasting-competition.com/
2Note that the Easter Fridays in 1996 to 1998 should have the indices
“19”, “376” and “753” in the given data and the Saturdays before
Christmas da y of 1996 and 1997 have the indices “283” and “ 648”.
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
8
1
F
Strong weekly periodic behavior dominates the
frequency spectrum, usually with higher values o
n
Thursday and/or Friday;
2
F
Important holidays such as the Christmas holi-
days (including the New Year holiday) and the
Easter holidays have a visible impact on the data;
3
F
Several of the time series such as time series
No. 93 and No. 89 show strong seasonal behavior,
i.e. a yearly period;
4
F
Some of the time series (like No. 26 and No. 48)
show a sudden change in their statistics, e.g. a shif
t
in the mean value.
The dynamics associated with data have deterministic
and stochastic components. In general, they will not be
stationary, as for example more tourists are visiting this
area or a new shopping mall has opened. There are in
total 111 time series in the database, with each time se-
ries representing the withdrawal from one ATM machine
and each data point in a particular time series indicating
the withdrawal amount of the day from the particular
ATM machine. All 111 time series were recorded from
the 18th March 1996 until 22nd March 1998, and there-
fore contain 735 data points. The task of the competition
is to predict the withdrawal of each ATM machine from
23rd March 1998 to 17th May 1998, 56 data points in
total. The evaluation of prediction performance is based
on the so-called SMAPE error value defined in Equation
(14), where t
x
and ˆt
x
are respectively the predicted
output and data.
(a) 4, 4, 4 (b) 4, 4, +4
(c) 4, +4, 4 (d) 4, +4, +4
Copyright © 2009 SciRes JILSA
9
RUDOLF SOLLACHER, HUAIEN GAO
(e) +4, 4, 4 (f) +4, 4, +4
(g) +4, +4, 4 (h) +4, +4, +4
Figure 8. Stability test of a trained SpiralRNN model from simulations with the Lorenz data set. Each sub-figure shows the
trajectory of two hidden neurons during autonomous tests with different initial inputs. The starting point coordinates are
given beneath each sub-figure. The X-axis and the Y-axis respectively refer to activation values of these two hidden neurons.
The first five positions of the trajectory are denoted by black square marks
1ˆ100
()2
ˆ
ntt
smape ttt
x
x
E
nx
x


%
(14)
The competition did not require online learning. Nev-
ertheless, we used this competition to test our approach
against sophisticated offline learning methods. As prior
knowledge was available through the data sets and as
computational complexity was not limited for this appli-
cation, we have extended our approach in the following
way:
(a) data were rescaled and additional artificial in-
puts were added pro viding periodicity information;
(b) a “committee of experts” was employed in or-
der to reduce the prediction error due to local
suboptima;
(c) the EKF learning process was modified taking
into account the SMAPE error function.
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
10
(a) sample 1
(b) sample 2
(c) sample 3
(d) sample 4
Figure 9. Sample data from NN5 competition dataset with
cross markers indicating Saturdays
The data presented to the neural network are mapped
to a useful range by the logarithm function. In order to
avoid singularities due to original zero values, they are
replaced with small positive random values. Additional
sinusoidal inputs are provided as a representation of cal-
endar information. These additional inputs include:
Weekly behavior addressing feature 1
F: Refer to
Figure 10(a); the period is equal to 7.
Christmas and seasonal behavior addressing feature
2
F and 3
F: It is often observed from the dataset
that, right after the Christmas holiday, withdrawal of
money is low and then increases during the year, fi-
nally reaching its summit value right before Christ-
mas. Seasonal features do not prevail in the dataset,
but they do exist in several of them, e.g. time series
9, 88. As both are regular features with a yearly pe-
riod, it makes sense to provide an additional input as
shown in Figure 10(b) which has period 365.
Easter holiday bump addressing feature 2
F: The
Easter holidays did not have as much impact on the
data dynamics as the Christmas holidays did, but it
shows an effect on the usage of ATM in some areas
(shown in some time series). Furthermore, as the
58-step prediction interval includes the Easter holi-
days of year 19 98, the pr ediction over the holiday
Copyright © 2009 SciRes JILSA
11
RUDOLF SOLLACHER, HUAIEN GAO
can be improved when the related data behavior is
learnt. This additional input uses the Gaus-
sian-distribution-shape curve to emulate the Easter
holiday bump as in Figure 10(c).
(a) Weekly-input
(b) Christmas-input
(b) Easter-input
Figure 10. Additional inputs of neural networks. On the
X-axis is the time steps, and Y-axis is the additional input
value
Figure 11. Online training of SpiralRNN in NN5 competi-
tion with additional inputs
Supplied with additional inputs, SpiralRNNs were
trained online (see Figure 11) such that data were fed-in
the network one-by-one and network parameters are
trained before the next data is fed-in.
Although SpiralRNN is capable of learning time series
prediction, the learned weights usually correspond to
local minima of the error landscape as mentioned in [11].
As computational complexity is not an issue for this
competition, a committee of experts ansatz is applied.
The committee of experts consists of several Spi-
ralRNN models with identical structure but different
initialization of parameter values. Each SpiralRNN
model operates in parallel without any interference to the
others. They are trained online using teacher forcing. The
importance of an expert’s output is determined by the
average SMAPE value over the last 56 steps. Therefore,
at time step t=6791, each model k produces a prediction
xk,t for the next 56 steps using its output as the input for
next time step. After that, online learning continues until
the end of the training data. Then the SMAPE error value
for the last 56 time steps is calculated according to Equa-
tion (15), where Es refers to the SMAPE error.
56
679
1(
ˆ
56 kt
kskt
t
Ex )
x
(15)
1The value is calculated via: t = 795-56 = 679.
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
12
The final prediction for the 56 time steps following t =
735 consists of a weighted average of the committee
members’ autonomous predictions according to
2
1
n
k
k

(16)
2
1
1n
ttk k
k
x
x
(17)
The whole procedure is listed in Table 1, and a sche-
matic diagram Figure 12 depicts the organization of the
committee.
As a last step we have to adapt the calculation of the
gradient matrix in Equation (5) to the error func-
tion
Table 1. Committee of experts
0. Initialize the n experts;
1.
For a SpiralRNN model k, implement online
training with the data and make a 56-step
prediction Uk at time step t = 679. The pre-
diction value Uk will be compared to the data
in order to obtain the average SMAPE error
value according to Equation (15).
2.
After the prediction at time step t = 679,
continue the online training till the end, and
produce another 56-step autonomous predic-
tion Vk.
3. Based on their k
values, combine the pre-
diction Pk according to Equation (17).
Figure 12. Committee of SpiralRNN experts where their
weights in committee are determined by their SMAPE val-
ues on testing dataset
Figure 13. Comparison between result and data showing
weekly behavior for time series 35. Dashed line with circles
is the data and solid line with squares is the prediction. The
x-axis shows the time in days
in Equation (14) according to Equation (20) with
()
t
ylnx
t
being the output and ˆ()
ˆt
tln
y
x
the data in
logarithmic scale:
ˆ
exp()exp( )
tt
Sy y
(18 )
ˆ
exp( )exp( )Dy
tt
y
(19 )
exp( )()
tt
t
t
yy
D
sign D
SS

 

w
(20)
For data sets with missing values we replace these
with predicted values and simultaneously skip the weight
update while still aggregating the gradient.
We have evaluated our approach using the last 56 data
of the available data while the remaining data were used
for online-training. In Figure 13, prediction and data are
compared for time series 35 which has a pronounced
weekly periodicity. Obviously, this periodicity can be
reproduced pretty well, even details like the small bump.
Seasonal behavior of data can also be learned as
shown in Figure 14. The curves in both sub-plots begin
with the values in Christmas holidays (with time indices
around 280 and 650). The rise and fall of the data during
the first about 100 days and a subsequent rise in Figure
14(a) can also be seen one year later in Figure 14(b).
Obviously, the model is able to capture this behaviour
pretty well.
Copyright © 2009 SciRes JILSA
13
RUDOLF SOLLACHER, HUAIEN GAO
Easter holidays can also be recognized by the trained
model as shown in Figure 15 for time series 110. Since
the expert-weight in the committee is determined by a
(a) seasonal-data previous year
(b) seasonal-data and prediction
Figure 14. Comparison between prediction and data for
time series 9 showing seasonal behavior. Dashed curve is
the data and solid curve is the prediction. The x-axis de-
notes time measured in days. The upper figure shows the
same period one year before
Table 2. Statistic results. The committee-averaged
SMAPE error value of the expert committee on all 111 time
series with varying number of experts
Figure 15. Prediction result (solid curve) and data (dashed
curve) for time series 110 showing a peak around Easter.
The x-axis displays time in days
period not containing Easter holidays, the prediction for
them might be expected to be less accurate. Neverthe-
less, as shown in Figure 15, the corresponding bump in
the data has been learned well. The Easter holidays in
1996 to 1998 occur at times around 20, 375 and 755. In
Figure 15, the prediction for the Easter holidays 1998
shows a similar behaviour as the Easter holidays in 199 6
and 1997 with a pronounced peak.
Table 2 shows the SMAPE errors on the test set (i.e.
the data from the last 56 time steps) for a varying num-
ber of experts. Obviously, the number of experts has no
significant influence on the average result. This indicates
that the learning process is quite reliable; in addition this
allows saving computational effort using a small com-
mittee. Figure 16 shows the histogram of committee-
averaged SMAPE values over the 111 time series in the
dataset. The majority of the results have a SMAPE value
around 20.
Our approach has won the time series forecasting
competition at WCCI20081 for both the full data set and
a reduced data set consisting of 11 time series. Among
the competitors was also commercial software currently
implemented in the ATMs. Although a few sophisticated
statistical and neural n etwork appro aches wh ich were no t
taken into account for the awards performed better, the
final results for the top 10 performing approaches are
quite close. The most important result for our approach is
that even with a remarkably simple extension our online
learning approach is competitive to world class statistical
and neural network solutions.
3.2 Mouse Tracking — A Toy Example
Besides using artificial data we also made experiments
1http://www.wcci2008.org
/
.
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
Copyright © 2009 SciRes JILSA
14
with real-life data. For this purpose we recorded the po-
sition of a computer mouse moved along a periodic tra-
jectory by a person and trained a SpiralRNN online with
these data. The background for this experiment is e.g.
applications in logistics and factory automation where
the motion of transported goods or robot arms has to be
tracked and analyzed.
For the mouse tracking we did some preprocessing on
the data in order to become independent of the velocity
of the mouse. For th is pur pose we took th e path- leng th of
the trajectory as the “time”-variable and interpolated the
position at equidistant length intervals. These positions -
after a suitable rescaling - are used as input for the Spiral
RNN during online training, trying to learn to predict the
next position on this discretized trajectory. Figure 17
illustrates such discretization. Figure 18 shows one par-
ticular trial. The blue line is generated by a user moving
the mouse along a lying figure 8. After about thirteen
iterations of this pattern the online training is stopped
and the Spiral RNN is fed with its previous output thus
generating the red trajectory. Obviously, th e Spiral RNN
has learned to reproduce the full trajectory. The mean
square error (MSE) is normalized to the variance of the
time series. The reader may notice that this pattern is
nontrivial in the sense that a certain amount of short-term
memory is necessary in order to select the correct direc-
tion at the crossing.
Such a pattern produces 2-dimensional periodical time
series as targets for the neural networks. For this exam-
ple we have also analyzed the frequency spec t rum of the
data and the predicted outputs of different neural net-
work models. The result is given in Figure 19 with blue
dashed and green solid lines for the two time series re-
spectively. SpiralRNN produces (refer to Figure 19(b))
autonomous output trajectories with a frequency spec-
trum similar to the frequency spectrum of the target data
Figure 16. Histogram (over 111 time series) of committee-
average SMAPE error
Figure 17. Segmentation by boundary points in MouseTrac k ing . The hollowed arrow indicates the movement direction, the
dashed line is the movement path of the cursor, circles represent the position measured by computer in every 1/100 second,
and black dots refer to boundary points that segment the move ment path of cursor. The identical length between boundary
points is “0.02 ”. The positions of these boundary points ar e take n as the training data at each “time” step
15
RUDOLF SOLLACHER, HUAIEN GAO
Figure 18. Mouse tracking example: The noisy line in the upper figure is the trajectory generated by the user; the smooth line
is the test with the Spiral RNN trained online and iterated with its own output as next input. The lower figure displays the
logarithm of the normalized one-step prediction mean square error which decays along training
(refer to Figure 19(a)), while the other models cannot.
If the training time was too short, the network usually
runs into the closest of a set of fixed points; two of them
are usually located within the two loops of the Figure
eight. This experience is confirmed with more complex
pattern where the required training time increases corre-
spondingly. One reason is the usually large amount of
noise due to manually moving the mouse. This noise
prevents establishing a large enough short-term memory
which may be required for correct prediction and favors
the development of fixed-point attractors.
Sharp turns in a trajectory are also difficult to repro-
duce. Typical examples are triangles or squares. Even if
the network has learned to reproduce such a pattern, it
frequency analysis of original trajectory frequency analysis of spiralRNN
(a) Target (b) SpiralRNN
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
16
fre
quency analysis of ESN frequency analysis of BDRNN
(c) ESN (d) BDRNN
Figure 19. Frequency analysis of the output trajectory after 5000 training steps of different neural network models with
around 100 parameters. Note that the vector value at frequency “zero” is trivial and is therefore omitted in diagrams, and
that the period length of data is 130 therefore the dominant frequency is 008.0
130
1
always produces smooth turns and it takes much more
additional training to sharpen these turns.
3.3 Duty Cycle Adaptation in Wireless Sensor
Networks
In energy autarkic wireless sensor networks, achieving a
long lifetime requires the wireless sensors to switch off
individual components as often as possible. Most energy
is used by the transceiver in transmit mode as well as in
receive mode. Therefore, sensor nodes have organized
their activity in duty cycles consisting of an active period
and a passive period with most components switched off.
The longer the second passive period is, the smaller is
the duty cycle and the longer is the lifetime of the sensor
node.
One way of reducing the duty cycle effectively is to
skip listening to neighboring sensor nodes for their data
and instead use predicted values. For this reason, a sen-
sor node does not have to transmit its own sensor values
to its neighbors as well. This implies that sensor nodes
can increase their inactivity period thus reducing the duty
cycle. Of course, the prediction error will increase also
with decreasing duty cycles. Therefore, there will be an
optimal choice concerning prediction error and energy
saving.
We analyzed this application example using a simula-
tion of a room equipped with 10 temperature sensor
nodes and 4 heat sources as shown in Figure 20. The
varying strength of the heat sources was modeled by a
4-dim. chaotic Lorenz time series with 6000 time steps; a
short interval is shown in Figure 21. The diffusion of
heat from the heat sources into the room was modeled by
a simple diffusion model, discretized in space and time,
with random additive noise. An example for the resulting
sensor values is shown in Figure 22.
Each of the 10 sensor nodes is equipped with a Spi-
ralRNN supposed to predict the sensor node’s measured
temperature value and those of the neighboring nodes.
The SpiralRNN had 100 trainable weights for this
evaluation. The neighbor nodes are those nodes which
have a direct communication link to the sensor node (see
Figure 20). In regular intervals, the sensor nodes broad-
cast their current sensor value to their neighbors. One
complete exchange involving all 10 sensor nodes corre-
sponds to one time step in the simulation.
In order to reduce the duty cycle, the sensor nodes do
not listen to their neighbors every second time step. This
saves energy for communication: with a normal duty
cycle, a sensor node with k neighbors has to run its trans-
ceiver 2*(k+1) times in two intervals; for a reduced duty
cycle it is 2+k times in two intervals, a reduction by a fa-
ctor of 2
22
k
k
. With just one neighbor this is a factor of
Copyright © 2009 SciRes JILSA
17
RUDOLF SOLLACHER, HUAIEN GAO
Figure 20. Simulation environment with sensors (numbered circles) and heat sources (stars). Also shown are the communica-
tion links between sensor node s
Figure 21. Example of time series of heat sources. Only one period is shown. Similar pattern occur in other periods
Figure 22. Example of measured temperature values at the different sensors. Only one period is shown. Similar pattern occur
in other periods.
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
18
Figure 23. Comparison of relative prediction error for normal duty cycle (continuous line, circles) and reduced duty cycle
(dotted line, squares, listening only every second interval). Shown are mean, minimum and maximum of the rms error of 30
runs
0.75, with 4 neighbors it is a factor of 0.6, with 9
neighbors already 0.55 and quite close to the large-k
limit of 0.5.
However, the picture is not complete without looking
at the accuracy of the predicted instead of measured
sensor values. We performed 30 independent runs with
randomly chosen locations of sensors and heat sources,
random scaling of each of the heat sources with a factor
in the range [0,1] and additive noise of amplitude 0.01 at
each update of the heat diffusion on every grid point.
The result for the relative error - normalized for each
sensor to the variance of its sensor value time series - is
shown in Figure 23: For up to 5 pred iction steps the per-
formance with the reduced duty cycle is almost the same
as for the normal duty cycle. The outliers with a normal-
ized error larger than one are due to sensors with an al-
most constant sensor value whose variation is mainly due
to noise. This result indicates that there is even more
room for duty cycle reduction for this example: Listening
to the neighbors only every fourth interval yields a duty
cycle reduction of 4
44
k
k
; with 4 neighbors this gives a
reduction factor of 0.4, with 9 neighbors it gives 0.325
and the large-k limit is 0.25.
4. Conditional Prediction
So far, SpiralRNN turned out to be excellent in online
learning the dynamics of stationary processes. However,
in reality we often have to deal with non-stationary
processes. A typical example are cars approaching a
crossing: without knowledge about their destination one
usually can predict their approaching the crossing and
their leaving again, but not whether and into which di-
rection they turn at the crossing. Even our daily activities
consist of sequences of typical pattern like walking,
standing, talking, eating, and drinking and so on. Again,
without additional knowledge about the motivation of a
person usually it is impossible to predict the sequence of
these patterns, which can themselves be predicted very
well.
4.1 Application Example: Warehouse Logistics
We consider an application example from the logistics
domain: A warehouse stores goods of different type in
different areas (refer to Figure 24). Newly delivered
goods have to be transported to these areas by trucks; we
assume that these trucks always use the same trajectory
to a specific destination. Each truck is equipped with a
SpiralRNN learning the different trajectories. However,
different from previous examples we assume that there is
additional information available about the type of prod-
uct; for example, this could be provided by a RFID
reader at the entrance registering newly delivered goods
carrying RFID tags with product information. After hav-
ing learned the trajectories accompanied with this addi-
tional information, the Spiral RNN just receives this ad-
ditional information and has to recall the trajectory. Th is
can be used for diagnosis purposes (e.g. is it the correct
trajectory?) or for autonomous truck driving.
Copyright © 2009 SciRes JILSA
19
RUDOLF SOLLACHER, HUAIEN GAO
In addition to the current position coordinates, the in-
put vector of the Spiral RNN now also stores an addi-
tional trigger signal [12]. The network output is trained
to predict the changes in this vector. At occurrence of
each trigger signal, the hidden state of the network is
reset to zero and the input vector consists of some arbi-
trary, but otherwise fixed position vector and of the trig-
ger signal; the latter is kept at this initial value until an-
other trigger signal indicates completion of the trajectory
or start of a new trajectory. At subsequent steps, the
network - while in online learning mode - learns to pre-
dict the changes in position and change in the trigger
signal; the latter is actually zero. In recall mode, the pre-
vious output added to the previous input together repre-
senting the new position and trigger signal is fed back as
input for the next time step.
Figure 24Application example “warehouse”: Depe nding on their type, newly delivered goods have to be stored in different
areas labeled by A-F. The trajectories used by trucks are shown as well
Figure 25. Conditional prediction of trajectory to area A (circles). The starting point of the trajectory is at [0.4,0].
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
20
Figure 26. Conditional prediction of trajectory to area B (circles). The starting point of the trajectory is at [0.4,0].
Figure 27. Conditional prediction of trajectory to area C (circles). The starting point of the trajectory is at [0.4,0].
Figure 28. Conditional prediction of trajectory to area D (circles). The starting point of the trajectory is at [0.4,0].
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
Copyright © 2009 SciRes JILSA
21
Figures 25-28 show the performance of a trained Spi-
ralRNN on reproducing the trajectories to the areas A, B,
C and D given just the initial trigger signal. The network
had 200 trainable weights and each trajectory was pre-
sented 20 times until the root mean square error (RMSE)
for each trajectory was below a threshold of 0.05, i.e.
half grid spacing. Obviously, the trajectories have been
learned pretty well.
4.2 Storage Capacity
Encouraged by the good performance of this approach,
we investigated the storage capacity of SpiralRNN in
dependence on the number of hidden neurons and the
length of the pattern time series. For this purpose we
created random two-dimensional trajectories with each a
different but randomly chosen trigger signal value. We
considered trajectories of 6 and 11 steps. The perform-
ance of a trajectory prediction is evaluated as above as
the MSE over the whole trajectory. We consider a num-
ber of trajectories trainable if within at most 30 trials
with each consisting of 50 presentations of each trajec-
tory the RMSE over all trajectories is at least once below
the threshold of 0.05. The whole process contains 5 steps,
as the following:
1) The experiment starts with small networks (The
value of nh starts from 6, such that the network
comprises 3 hidden units and each hidden unit has
2 hidden nodes); Training data is the combination
of np = 2 trajectory patterns.
2) With each pair of nh ,np, at most 30 simulations
are performed, each of which contains 50 training
rounds. In each training round, th e learning model
is trained on-line with the presentation of these np
trajectory patterns, in a shuffled sequence but
each pattern only once. The hidden-state vector of
network is reset before the training with any tra-
jectory pattern starts. After every 10 training
rounds, a testing round starts, trying to reproduce
all np trajectories in an autonomous manner sepa-
rately.
3) The autonomous output results are evaluated. Let
lp be the length of trajectory patterns (the lp value
is identical to all pattern), xi,k be the ith-step ahead
prediction over the kth pattern, and ˆik
x
be the
respective target, the evaluation error ε is calcu-
lated as:
2
11
11 ˆ
pp
ln
ik ik
ik
pp
x
x
nl



(21)
4) If it satisfies 2
005
, this particular Spi-
ralRNN network with nh hidden nodes is claimed
to be capable of reproducing np number of trajec-
tory patterns of length lp, and thus the associative
memory satisfies Ma np. The model (with the
same structure but re- initialized values) will then
be assessed with a training data with more pat-
terns, i.e. higher np value.
If 2
005
is not satisfied at least once within 30
simulations, it is claimed that SpiralRNN with nh
hidden nodes is not able to reproduce np number
of trajectory patterns with length lp. New experi-
ment will start with same value in np but a larger
SpiralRNN model by increasing the nh value by 3.
Note that the threshold value is set to 0.05, half
the grid distance 0.1, so that the prediction can be
rounded to the near est grid point if the error is less
than 0.05.
5) The experiment are continued until simulations
with nh = 60 are finished.
Figures 29 and 30 show the dependence of the number
of trainable pattern on the number of hidden neurons.
Obviously, the graphs show an almost linear dependence.
For short pattern of length 6 we get a relation np / nh =
0.64 ± 0.04 and for long pattern of length 11 we get np
/ nh = 0.31 ± 0.02.
5. Tiny OS Implementation
Finally we report on an implementation of SpiralRNN in
a typical wireless sensor platform. One example of such
a sensor platform is the TelosB[13] sensor mote from the
University of California at Berkley. It provides sensors
for temperature, light and humidity. A microprocessor
(Texas Instruments, MSP430, 16 bit, 8 MHz, 48kB ROM,
10kB RAM) allows to run software for communication,
data processing and energy management. A wireless
transceiver (Chipcon 2420, IEEE802.15.4, 2.4 GHz, 250
kbps) provides communication links to other sensor
motes or to a gateway.
RUDOLF SOLLACHER, HUAIEN GAO
22
Figure 29. Pattern storage capacity of SpiralRNN with dif-
ferent number of hidden nodes and with patterns of length 6
Figure 30. Pattern storage capacity of SpiralRNN with di-
fferent number of hidden nodes and with patterns of length 11
Figure 31. Structure of a typical smart sensor mote. TinyOS is the specific operating system for embedded system such as
smart sensor and provides interfaces among hardwares and softwares. NesC language, as a compiler, compiles and builds
softwares for different applications
We have deployed TinyOS, a small footprint operating
system developed at the University Berkeley, on such
motes. Programs developed for TinyOS are written in the
programming language NesC [13]. Figure 31 shows the
typical software architecture of a sensor mote. It allows
implementing even advanced functionalities like analysis
and learning on such embedded devices.
In order to accelerate the task execution, we replaced
the floating point operations by suitable fix-point opera-
tions. Thus we achieved a runtime for one online learn-
ing iteration of the SpiralRNN with a single input/output
node and 6 hidden nodes (totally 24 parameters in the
neural network) of roughly 300 milliseconds, a reduction
of about a factor 8 compared to a floating point arith-
metics implementation. Figure 32 shows a comparison of
the floating-point model and the fixed-point model with
respect to processing time and memory requirement
(RAM) for a TelosB sensor mote (both models have sin-
Copyright © 2009 SciRes JILSA
23
RUDOLF SOLLACHER, HUAIEN GAO
gle input and output neuron). The limited memory ca-
pacity (10 Kbytes) of such a sensor mote prevents de-
ployment of floating-point model images with more than
36 parameters or running it with more than 32 parame-
ters. In contrast, the fixed-point model takes less than
one second for each online learning step for networks
with up to 40 weight parameters (i.e. 10 hidden nodes).
Similarly, the fixed-point model requires less RAM al-
lowing working with more complex neural networks.
It is also interesting to compare the processing time of
the pure feed-forward step with the complete online
learning step. Figure 33 compares two fixed-point mod-
els with feed-forwarding (FFD) only and with full online
learning (FULL) including feed-forwarding, gradient
calculation and extended Kalman filter update. Obvi-
ously, the full online learning step requires much more
time; while a pure feed-forward step requires only 11ms
for 48 weight parameters, the full online learning step
takes about 1500ms for the same network. As a conse-
quence, one can reduce the number of learning steps
when the neural network is well trained and thus save
quite some energy.
We close this section with a comparison of a theoreti-
cal estimate of the computational complexity with the
measured times for one update step. Measured in multi-
ples of the summed effort for an average elementary
fix-point or floating point operation consisting of multi-
plication, addition and potentially some copying, the
feed-forward iteration cost is proportional to 2
h
n
with nh the number of hidden neurons
and d the dimension of the input and output. This can be
expressed in terms of nw=(2d+2)nh+d1, the number of
weights, leading to nh=(nwd+1)/(2 d+2). The effort for
the derivative calculation is proportional to
and the effort for the extended
Kalman filter updates is proportional to 23
2w
dddn
(21) h
dn
2
(
hh
nn
d
dn )
w
2
w
Although the complex of the matrix inversions in
Eq
full online learning
pared with the measured proc-
es
h
dn
22
352
www
ndndnn .
ity
uation (4) scales as 3
d, in practice, this is not the
dominant contribution asw
nd can be assumed. The
dominant contribution for tl online learning step
comes from the term 23 2
(2 2)
hw w
nn nd  of the gra-
dient calculation of the Kal
pure feed-forward step the leading term is
22 2
(2 2)
hw
nn d , i.e. a factor nw smaller than for the
step.
These estimates are com
he ful
man filter update. For the
sing times in Figure 34. Here, the processing times are
plotted against the estimated effort for d=1 and varying
nw. The linear relationships and the very similar propor-
tionality coefficients confirm the theoretical estimates.
As a by-product we get a typical processing time per
elementary optimized floating-point operation of about
0.063ms.
(a) Processing time
(b) RAM requirement
Figure 32. Comparisons of processing time and RAM Re-
amounts parameter in model
quirement between floating-point models and fixed-point
models that installed in TelosB sensor motes with different
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
24
Figure 33. Comparison of processing time between models
with feed-forward functionality only and with full function-
hown excellent prediction perform-
ance and stability during online learning making them a
rediction. Here, different temporal pattern are
ac
ction tasks. It turned out that the
nu
tation of SpiralRNN on a typical wireless sen-
sor platform. A significant reduction of the runtime has
be
ne learning rule can be derived from a Bayesian
the probability density
ality (including feed-forwarding, gradient calculation, ex-
tended Kalman filter learning)
6. Conclusions
Spiral RNN have s
good candidate for application in autonomous systems. A
first example are wireless sensor networks where Spi-
ralRNN can be used to reduce duty cycles and thus en-
ergy consumption leading to an increased battery life-
time.
Another interesting and not so rare use case is condi-
tional p
companied by an additional constant signal which can
be derived from an initial trigger signal. This approach
allows training different pattern reliably provided a trig-
ger signal is available. Such trigger signals can also be
used to suspend online training, e.g. in periods with just
noise on the input.
We have analyzed the storage capacity of SpiralRNN
in conditional predi
mber of successfully trained temporal pattern is pro-
portional to the number of hidden neurons. This result is
qualitatively similar to the memory capacity of Hopfield
networks [15]. As expected, the storage capacity for
longer pattern is smaller than the one for shorter pattern.
Further investigations might clarify whether there is a
typical storage capacity per hidden node and pattern
length.
Finally, we have reported about an embedded system
implemen
en achieved by applying fix-point arithmetics. This
allows developing solutions based on online learning of
dynamics with reasonably complex models. A typical
example is duty cycle adaptation in wireless sensor net-
works.
Appendix
The onli
approach forˆ
tt
f
wX of the
weights w given all previous data 10tt
t
 
xx x
X.
ˆ{}
ˆˆ ˆ
(a) Feedforward
(b) Full online learning step
Figure 34. Processing time as a function of the theoretical
estimate of computational complexity for fixed-point im-
plementation: fnline learning
step (bottom). The linear behavilmost identical
eedforward (top) and full o
or and the a
proportionality factors of 0.063 ms/operation indicate the
validity of the theoretical estimate
Copyright © 2009 SciRes JILSA
25
RUDOLF SOLLACHER, HUAIEN GAO
We start with a generic dynamic system:

1
1ˆt
ttt
ssw
x
H (22 )

1
ˆˆ
tt
tt t
sw
xx
G (23)
t1tt
ww
Equation (22) describes the dynam
system by a (hidden) state vector
of depends in some, usuallynlinear way on the
pr
(24)
ic evolution of the
t
s at time t. The value
t
s
at
no
mo
evious hidden state vector 1t
s, on the previous ob-
servation vector 1
ˆt
x and on the del parameters t
w.
Equion (23) describes the measurement process: ˆt
x is
the vector of observations (sensor data) at time
t, (. . . )G is the el estimation of these observations
based on t
s, t
w and 1
ˆt
x. The variable t
mod
the
measurement noise vector, assumed to be normally dis-
trith a probability distribution function (p.d.f.)
is
ibuted w
(
tt
0 )
t
f
R
NFinallyuation (24) descris the e-
volution of the model parameters: that the dynamics of
the environment within a reasonable time window are
. , Eqbe
assumed to be stationary, i.e. the model parameters are
static, up to additive random fluctuations t
; these are
assumed to be normally distributed with zero mean, co-
variance matrix t
Q and thus with a corresponding p.d.f.
(0)
ttt
f
Q
N.
Using the Bayes rule, the chain rule and the Chap-
man-KolmogorovEquation (16), the conditional p.d.f.
ˆ
tt
f
wXfor the moderameters at time t given the current
an
l pa
d all previous observations 10
ˆ{}
ˆˆ ˆ
tt
t

xx x
X is
given by following equations:
1
11
ˆˆ
ˆ
ˆ
ˆ
t
tt
tt
tt
1
ˆ
ˆ
ˆ
t
tt
tt
f
w
ff
ff
x
X
X
(25)
wx
ww
x
XX
X
11
ˆˆˆ
ˆˆ
tt
ttttt
tt1t
t
f
ffd
s (26)


wswsw
xx
XXX
11
111
1
ˆˆˆ
tttt
ttt
t
fffd


wwww
w
XXX
(27)
Equation (25) is the famous Bayes rule, Equation (26)
introduces the hidden-state vector into the “evi-
dence” p.d.f. and Equation (27) is the “prio
the model parameters . The last equation
th
t
s
r” p.d.f. for
introduces
t
w
e conditional p.d.f. 11
ˆ
tt
f
wX which allows to interpret
Equation (25) to Equation (27) as iterative update equa-
tions for ˆ
tt
f
wX.
Before the on-line training rules from these update
equations are derivedas to introduce some as-
sumptions and approximations for the specification of
each p.d.f
, one h
.:
111
ˆ11
() ()
ttt
att
f
P


w
Sw
w
XNis assumed to be
normally distributed with mean 1t
w and co-
variance matrix 1t
P. In general this p.d.f. will
be multi-modal; however, if the environment
doesn’t change its dynamics
he m
eco
too fast and if the
model is powerful enough, then todel pa-
rameters should bme static and this assump-
tion is justified.
11
ˆ1
( )
tt tttt
()
b
f
Q
 

ww ww
XN according to the a-
ssumption made in Equation (24).
S
t
with
ious
121
ˆ
()(( (...)))
ˆˆ
tt ttt
ctt
f

  
sw
Ssww
xx
XHH
the dots indicating an iterat
e prev
hidden states
ion of the function
()H which further represents th
12...
t
s according to
uation (22). The Dirac delta-function
eq-
()
corresponds to a normal distribution with in-
simally small covariance matrix and re-
flects the ass any uncertainty
comes from the measurements and from (
dom) changes of model parameters.
1
ˆ
ˆtttt
f
finite
umption that
ran-
()
d
sw
xX in Equation (26) is normal distributed
with mean value 1
()
ˆt
tt
swx
G and covariance
matrix t
R according to Equation (23)
S
, i.e.
11
()
ˆˆ
tt
ˆ
ˆttt ttt t
f
R


 sw
xx
NG .
the specifications
sw
xX
Sa
b, Equation (27)
olution normal distributions and thus f
S
1
ˆ
tt
wX
With and
is a conv of
can be easily calculated:
1
1
ˆ
1
ttt
tt
tt t
f
P
PPQ





www
XN (28)
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
26
Given the
-function lik.f. 1
f in specifica-
tion

c
S and the p.d.f. of 1
ˆ
ˆttt t
f
sw
xX in specification

d
S, one finds:
e p.dˆ
tt t
swX
11
ˆ
ˆˆˆ
ttttt
tt t
f
R
 
 
w
xsw
xx
XNG (29)


21
ˆˆ
tt
tt
ade in s
t
sww

 
xx
HH (30)
The same assumption as mpecification
a
S
11
ˆ
tt
f
wX - namfor ely the slow change of the m
ram when the model has been well trained -
ju er approximation: the function
odel pa-
eters t
w
stifies anot
h ()
G can
be rized with respect to their dependence
mters . In particular,
linea
odel param on the
et
w
11
()()
ˆtt
tttt
g
swxww
x
where the output vector t
x and the gradient matrix
are respectively defined as:
(31)

1
1and
ˆt
t
t
tt
d
 
x
xs
wx
G (32)
1t
d
w
Finally, the definition of is given here.
pa
t
s
tio
According
to Equation (30) and Equan (31),
t
s should be the
hidden state vector calculated with the previous model
rameters 1t
w. However, this requires a recalculation
of the whole history wheer the model parameters
change and therefore is not a viable slution fo
learning in distributive sep Instead, the
app
th
nev
nsor a
3)
he
o
plication. r on-line
roximation in Equation (3is applied, which implies
at the model takes
t
s as tflawless hidden value.


21
21
ˆˆ
tt
tt
t

 sww
xx
HH (33)
With the approximations in Equation (31), Equation
(32) and Equation (33), the calculation of 1
ˆ
ˆttt
X
f
w
x in
Equation (29) is rewritten as:
1
ˆ
ˆ
1
1
1
(( )
ˆˆ
()
ttt
tt
t
t
t
tt
)R
 
 
x
sw
xx
ww
X
NG
), the p.
fw
(34)
Combining Equation (28) and equation (34d.f.
ˆ
tt
f
wX in Equation (25) ends up with a normal distribu-
Equation (35). Note that is computed
n (28) and is the gradient mtrix.
tion in
Equatio
t
P
a
in
ˆ()
ttt
tt
f
NP

www
X 35) (
1ˆt
tt tt t
PR
x
ww x
1T (36)
on
1
†† †
tt
TT
tttt
PP
PPRP




 
(37)
Equation (35) is consistent with specificati
a
S
whilst Equation (36) and Equation (37) constitute t
extended Kalman filter (EKF)
[16,17,18]).
REFERENCES
[1] T. Schott, “Global megatrends and their effects on
production of the future,” in Conference Proceedings, 5th
wikipedia.org/wiki/Ambient_intelligence.
[3] X. Cai, N. Zhang, Grthy, and D. C. I.
Wunsch, “Timh recurrent neural
networks using a hybrid pso-ea algorithm,” in Proceed-
f IEEE International
orks IJCNN ’05, R.
d
pp. 815–
analyzing and
nt
he
(for more details see
the
International Conf ere nce on Ind ustrial Informatics, Vienna,
Austria, Vol. 1, pp. 18, July 23-27, 2007, http://www.
indin2007.org/keynote_schott.php.
[2] Wikipedia, Ambient intelligence.
http://en.
. Venayagamoo
e series prediction wit
ings of IEEE International Joint Conference on Neural
Networks, N. Zhang, Ed., Vol. 2, pp. 1647–1652, 2004.
[4] H-G. Zimmermann, H-G. Zimmermann, R. Grothmann,
A. Schafer, and C. Tietz, “Dynamical consistent recurrent
neural networks,” in Proceedings o
Joint Conference on Neural Netw
Grothmann, Ed., Vol. 3, pp. 1537–1541, 2005.
[5] J. L. Elman, “Finding structure in time,” Cognitive Sci-
ence, Vol. 14, No. 2, pp. 179–211, 1990.
[6] A. Waibel, T. Hanazawa, G. Hinton, and K. Shikano,
“Phoneme recognition using time-delay neural net-
works,” IEEE Transactions on Acoustics, Speech an
Signal Processing, Vol. 37, No. 3, pp. 328–339, 1989.
[7] P. Mastorocostas and J. Theocharis, “On stable learning
algorithm for block-diagonal recurrent neural networks,
part 1: The RENNCOM algorithm,” IEEE International
Joint Conference on Neural Networks, Vol. 2,
820, 2004.
[8] H. Jäger, “The ‘echo state’ approach to
training recurrent neural networks,” German National
Research Center for Information Technology, Technical
Report GMD 148, 2001.
[9] H. Gao, R. Sollacher, and H-P. Kriegel, “Spiral recurre
neural network for online learning,” in Proceedings of the
15th European Symposium on Artificial Neural Networks,
Copyright © 2009 SciRes JILSA
RUDOLF SOLLACHER, HUAIEN GAO
Copyright © 2009 SciRes JILSA
27
ms and D. Zipser, “A learning algorithm for
networks,” in Proceedings of
nal Intelligence and
rna-
d S. S.
ESANN’2007, April 25-27, 2007, Bruges, Belgium, M.
Verleysen, Ed., pp. 483–488, 2007.
[10] R. J. Willia
continually running fully recurrent neural networks,”
Neural Computation, Vol. 1, pp. 270–280, 1989.
[11] R. Sollacher and H. Gao, “Efficient online learning with
spiral recurrent neural
Venk
World Conference on Computational Intelligence WCCI ’08,
2008.
[12] H. Gao and R. Sollacher, “Conditional prediction of time
series using spiral recurrent neural network,” in Proceed-
ings of the European Symposium on Artificial Neural
Networks-Advances in Computatio
stoc
Learning, Bruges , Belg ium, April 23 -2 5 , 2008.
[13] J. Polastre, R. Szewczyk, and D. Culler, “Telos: Enabling
ultra-low power wireless research,” in The Fourth Inte
tional Conference on Information Processing in Sensor Net-
works: Special track on Platform Tools and Design Methods
for Network Embedd ed Sens ors, p p. 364– 369, 2005 .
[14] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer,
and D. Culler, “The nesC language: A holistic approach to
networked embedded systems,” ACM SIGPLAN Notices,
Vol. 38, No. 5, pp. 1–11, 2003. http://www.tinyos.net/.
[15] R. J. McEliece, E. C. Posner, E. R. Rodemich, an
atesh, “The capacity of the Hopfield associative
memory,” IEEE Tra nsact ions on Information Theory, Vol.
33, No. 4, pp. 461–482, 1987.
[16] F. Lewis, “Optimal estimation: With an introduction to
hastic control theory,” A Wiley-Interscience Publica-
tion, ISBN: 0-471-83741-5, 1986.
[17] R. Kalman, “A new approach to linear filtering and pre-
diction problems,” Transactions of the ASME–Journal of
Basic Engineering, Vol. 82, pp. 35–45, 1960.
[18] G. Welch and G. Bishop, “An introduction to the Kalman
filter,” University of North Carolina at Chapel Hill, De-
partment of Computer Science, Technical Report, pp.
95-041, 2002.