A Journal of Software Engineering and Applications, 2012, 5, 128-133
doi:10.4236/jsea.2012.512b025 Published Online December 2012 (http://www.scirp.org/journal/jsea)
Copyright © 2012 S ciRes. JSEA
Multiple Action Sequence Learning and Automatic
Generation for a Humanoid Robot Using RNNPB
and Reinforcement Learning
Takashi Kuremoto1, Koichi Hashiguchi1, Keita Morisaki1, Shun Watanabe1, Kunikazu Kobayashi2,
Shingo Mabu1, Masanao Obayashi1
1Graduate School of Science and Engineering, Yamaguchi University, Ube, Yamaguchi, Japan; 2School of Information Science and
Technology, Aichi Prefectural University, Nagaku te, Aichi, Japan.
Email: wu@yamaguchi-u.ac.jp
Received Month Day, Year (2012).
ABSTRACT
This paper proposes how to learn and generate multiple action sequences of a humanoid robot. At first, all the basic
action sequences, also called primitive behaviors, are learned by a recurrent neural network with parametric bias
(RNNPB) and the value o f the inter nal nodes whic h are para metric b ias (PB) d etermining the ou tput with differe nt pri-
mitive behaviors are obtained. The t ra ini ng of t he RNN us es b a ck p r op a gat io n t hro ug h time ( B PT T ) met ho d . After that,
to generate the learned behaviors, or a more complex behavior which is the combination of the primitive behaviors, a
reinfo rce ment lear ni ng algorithm: Q-learning (QL) is adopt to determine which PB value is adaptive for the generation.
Finally, us ing a real humanoid robot, the proposed method was confirmed its effectiveness by the res ults of experiment.
Keywords: RNNPB; Humanoid Robot; BPTT; Rei nfor ceme nt Lea r ni ng ; Multiple Action Sequences
1. Introduction
To recognize, learn, and generate adaptive behaviors for
an intell igent social robot is a char ming theme and it has
been attracting researchers more than a decade. From the
view that dynamic complex behaviors of the robot are
composed by the spatiotemporal changed actions which
are so-called “primitive behaviors, or ele ment actions”,
gesture recognition has been approached by lots of me-
thods such as 3D models [1], self-or ga niz i ng map ( S OM )
[2,3], hidden Markov model (HMM) [4-7], dynamic
Baysian network (DBN) [8], recurrent neural network
(RNN) [9-11], and dynamic programming (DP) [12].
Tani with his colleagues proposed a RNN with para-
metric bias (RNNPB) which realize not only recognition
of mult iple b eha viors b ut also lea rning a nd ge nerati on of
them, based the finding of mirror neuron system in the
brain [13,14]. The input of RNNPB includes sensory
(visual or auditory) information and teachers motor in-
formation during the learning period, and the imitative
behaviors are output (generated) by the network accord-
ing to the observation of robot in the period of genera-
tion.
In this paper, we propose to combine RNNPB and
reinforcement learning (RL) [15] to realize (i) the mul-
tiple behaviors automatic generation or (ii) by the in-
struction of a human instructor. In another word, the
adaptive PB values are determined as the res ult of R L in
the generation process. Various patterns of primitive
behaviors are learned by back-propagation through time
(BPTT) [16] [17], and PB vectors are obtained as the
result. Considering the PB vectors as finite states of a
Markov decision process (MDP), a complex behavior
can be learned as an optimal state transition process of
these primitive behavior p atterns using the RL algorithm
such as Q-Learning. Using a humanoid robot PALRO
(Fujisoft Inc., 2010), experiments results confirmed the
effectiveness of the proposed meth od.
2. Proposed System
Multiple behavior instruction learning and complex
behavior learning system for a robot is proposed here. It
works as following process: (i) Time series data of
angles of robots joint s for p rimitive behavi ors ar e given
by a user (instructor) of the robot and they are recorded
as teacher signals; (ii) Train a recurrent neural network
with parameter bias (RNNPB) [9-11] with error back-
propagation method [16,17], which output ar e ti me series
angles of joints when arbitrary initial angles are set as
Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid
Robot Using RNNPB and Reinforcement Learning
Copyright © 2011 SciRes. JSEA
129
input of the network, to meet the different patterns of
primitive behaviors; (iii) Explore the temporal order of
different parameter bias (PB) vectors, which yields
different primitive behaviors, by the reinforcement
lear nin g ( R L) a l go ri t hm [15] . The details of the proposed
method are given in this secti on.
2.1. RNNPB
The recurrent neural network with parametric biases
(RNNPB) [9-11] is a Jordan-type recurrent feed forward
neural network [18] with three kinds of internal layers:
hidden layer, context layer and parametric bias (PB)
layer (Figure 1). Nodes in Hidden layer and Context
layer ha ve their internal states with sigmoid fu nctio n:
z
z
α
+
=e
f1
1
)(
. (1)
where α, a positive constant, is the gradient of the
function, and z is the inp ut vector for the node.
Specially, the input vector zh for the Hidden layer
nodes:
ccpbpbiih vuvuvu
z++=
. (2)
where ui = x(t), upb, uc and vi, vpb, vc are the output and
the connection weight of Inpu t layer, P B layer and lower
Context layer r espec tively.
The input vectors zo and zc for Output layer and Con-
text layer are given b y:
oho
twuz)x( ==+1
, (3)
chc wuz =. (4)
Hidden layer
Output layer
Input layer
Context layer
Context layer
x(t)
x(t+1)
xc(t+1)
xc(t)
(uh, wc)
(uh, wo)
(ui, vi)
(upb,vpb)
(uc,vc)
Figure 1. The structure of RNNPB. Internal layers are ex-
pressed in g ray co l or an d c o nn e ct ion s wit h sy napt ic wei g ht s
between lay ers are depicted wi th broken a rrow lines. C ontext
layers are same one but with temporal varied values of its
internal state, input and output.
where uh = f (zh) is the output of Hidden layer given by
Equation (1), and wo, wc are the connection weights be-
twee n t he Hidd e n la ye r and t h e O utput la ye r , t he Contex t
layers, respectively.
For the nodes of PB layer, its internal state upb changes
with the delta errors bp
t
δduring a period (a time series
window) l, when the network is trained by the error
back-propagation (BP) method [16,17]:
1
2
11
2
2/
2/
11 )]2([ −−+
+
∆++−+=∆ tpb
tpb
tpb
tpb
1t
lt
bp
t
tpb kk uuuuδu
ηη
. (5)
where
2121,,,kk
ηη
are learning coefficient, learning rate,
and internal coefficient of PB nodes.
The modificatio n of connec tio n weight s is executed b y
the back-propagation through time (BPTT) [16,17], that
is, errors between the output of the network and the
teacher data are used to adjust the weights of connections.
Detail formula is omitted here.
2.2. Q-Learning algorithm
Reinforcement learning (RL) is a kind of active learning
method which makes a learner finds its optimal action
policy by an iterative process of exploration and exploit-
tation [15] . For a process of finite state transitio n, usua ll y
a Markov decision process (MDP), that is, the transition
is and is only decided by the transition pr obability of the
last state, RL intends to find t he optimal transition prob-
abilities by adopting value functions of states and state-
action pairs. The state-action value function, usually
called Q function, serves as an index variable in a sto-
chastic function of action selection policy. In this study,
we use a trad itional RL name d Q-learni ng (Q L) [15] and
its learning algor ithm is as follows.
QL algorithm:
Step 1 Initialize Q(s,a)=0.0, where s,a are available finite
state space, and action space of the robot re-
spectively.
Step 2 Observe the state
s
of the environment around
the learner.
Step 3 Select an action
a
to change the state according
to a stocha stic function. For example, selec t an
action which has the highest value of
),( asQ
function dealing with the current state, with a
big probability and select other ca ndidate action s
with a small value
ε
. (Notice that if the num-
ber of actions is A, then t he sele ction pr obabil-
ity of the highest Q value is
A/1
εε
+−
, and
the selection probability of any other action is
10,/ ≤≤
εε
A
.
Step 4 Receive reward/punishment
R
from the environ-
ment/instructo r .
Step 5 Renew
),( asQ
as following:
Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid
Robot Using RNNPB and Reinforcement Learning
Copyright © 2011 SciRes. JSEA
130
)]Q(-
)maxQ(QQ s,aa,sRasas ′′
++←
γλ
[),(),(
. (6)
where
1,0 ≤≤
γλ
are learning rate and discount rate
respectively.
Step 6 Repe at Step 2 to Step 5 until the val ue of
),( asQ
converged.
The state space in our system is defined as different
PB vectors, and the action space of QL is also these PB
vectors fixed after the BP learning process. So the optimal
state transition process approvals correct combination of
primitive behaviors to be a complex behavior of robot, or
the correct executio n of the pri mitive behavior.
3. Experiments
The proposed method was applied to a complex behavior
learning and generation of a humanoid robot named
PALRO(Fujisoft, Inc., 2010) a s s hown in Figure 2.
There are 20 joints (actuators) in PALRO (arms, legs,
neck, a nd body) and the control of these angles of joints
in time series composes various actions of the robot.
Two kinds of expe riments were des igned:
Experiment I: a time series angles of joints yield a
primitive behavior, such as raising a hand, or turning to
left/rig ht, and severa l primitive behaviors yield a complex
Figure 2. A robot used in the experiment: PALRO, product
of Fujisoft, Inc., 2010.
(a) Pat t e rn 1 (b) Pattern 2
(c) Pat tern 3 (d) Pattern 4
Figure 3. P ri mitive be ha vi ors of the robot in ex perime nt I.
behavior of the robot such as a “dance” behavior;
Experiment II: 3 kinds of voice instructions corres-
ponding to 3 kinds of behaviors of robot were learned
and recognized.
Details of the experiments and results were described
in this section.
3.1. Experiment I: Complex Behavior Learning
and Generation
3.1.1 Primitive Behavior Learning / Generation
We designed 4 kinds of patterns of primitive behaviors
of the robot as shown in Figure 3: (a) Turn to left and
shake the hands; (b) Turn to right and shake the hands;
(c) Turn to right and shake the hands; (d) Raise the left
hand and stop in a special pose. Angle vector with 20
dimensions served the input of RNNPB, that is, the
number of nodes in Input layer and Output layer was 20
respectively. Teach signals of the primitive behaviors
were recorded by the storage function of the robot, that is,
the time series values of angles of the movements (pri-
mitive behaviors) obtained by the drive of an instructor.
Parameters of RNNPB and its learning process used in
the experiment are listed in T a ble 1.
Training results of RNNPB for the 4 primitive beha-
viors are shown in Figure 4, where (a) shows the learn-
ing cur ve (I te r ati o n ti me vs. R M SE ) ; ( b) PB va lue s o f the
4 patterns of behaviors; (c)-(f) time series values of 20
angles o f b e havior s ( ge nerati o n wi t h 3 0 % te acher si gnals).
The time interval “step” was set with 0.1 second/step in
(c)-(f).
Table 1. Parameters used in RNNPB in Exper iment I.
Description Symbol Value
The number of nodes in Input layer N 20
The number of nodes in Output layer N 20
The number of nodes in Hidden layer H 30
The number of nodes in PB layer P 2
The number of nodes in Context layer M 30
Learning rate of B P for c onnections β 0.02
Learning coeffi cient of PB nodes
1
η
0.9
Learning rate of PB nodes
2
η
0.9
Length of time series (width of time window) l 20
Internal coefficients f PB nodes k1, k2 0.9, 0.9
Gradient of sigmoid function
α
2.0
Gradient of sigmoid function f or PB
pb
α
5.0
Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid
Robot Using RNNPB and Reinforcement Learning
Copyright © 2011 SciRes. JSEA
131
In fact, if the teacher signals were not added during the
gener ation p rocess, that is, t he inp ut signa l was gi ven by
following equation:
)1()1()1()( −+−−= trtrt
d
xxx
(7)
where x(t-1) is the output of the network on time t-1,
xd(t-1) is the teacher data and r is the ratio o f the teacher
signal.
When r = 0.0, the output of the net work was ea sily to
fall in a static state and this problem needs to be im-
proved in the future study.
3.1.2. Co mplex Behavior Learning / Generation
Using the Q-Learning algorithm (QL) described in last
section, we decided the required orders of the primitive
patterns to compose the complex behavior: a “dance” of
robot.
The QL was defined with 4 states, that is, 4 values of
PB nodes and 4 actions as same as these PB values. The
training results gave the order of PB values used in the
generation process of robot as follows:
Pattern 3 - P attern 1 P attern 2 Pattern 3 – Pattern
4”.
The reward to reinforce the adaptive transition was
(a) Learning curve of RNNPB
Pattern 1
Pattern 2 ×
Pattern 3
Pattern 4
(b) PB values of 4 patterns
(c) Angles for Pattern 1
(d) Angles for Patter n 2
(e) Angles for Pattern 3
(f) Angles for Patter n 4
Figure 4. Leaning results of primitive behaviors.
Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid
Robot Using RNNPB and Reinforcement Learning
Copyright © 2011 SciRes. JSEA
132
input by the voice of instructor. Goodsaid by the in-
structor indicated the value of reward R = 0.1, and “Bad”
meant R = -0.1. Parameters used in the QL are listed in
Table 2. The learning curve (Iteration times vs. Success
rate; 10 experiments averages) and final time series val-
ues of angles of the complex be havior “dance” are shown
in Figure 5. The success rate of the dance composed by
the fixed order of primitive behaviors reached to 100.0%
after 12 tria ls o f QL.
The complex behavior generation movie is shown on
the WWW site:
http://www.nn.csse.yamaguchi-u.ac.jp/images/researc
h/Palro11Hashiguchi.wmv
Table 2. Parameters used in Q-Learning.
Description Symbol Value
Learning rate of Q
λ
0.1
Discount rate
γ
0.9
Reward (positiv e / negative)
R
0.1/ -0.1
Rate of random action
ε
0.1
(a) Learning curve of QL
(b) Time series values of 20-joint angles to rea lize the danceas a
compos ition of 4 primitive behaviors.
Figure 5. L eaning results of a complex behavi or dance”.
3.2. Ex perimen t II: Behav ior Instru ct ion
Learning and Recognition
Voice instruction can be captured and recognized by the
recorder and microphone of PALRO. However, special
behaviors need to be learned by the instructor and the
learning system RNNPB, and the relationship between
PB values and the voice instructions is able to be decided
by QL algorithm as sa me as the situa tion of order deci sion
of primitive behaviors for compl ex b e ha vio r le ar ni ng a nd
generation. In this experiment, we designed 3 kinds of
behaviors for PALRO which static picture are shown in
Figure 6: (a) Shake a hand; (b) Raise 2 hands; (3) A
handclap. Because the behaviors were limited in several
joints, 8 input / output nodes were designed in RNNPB,
and other parameters used in the experiment II are listed
in Table 3. P ar ameters used in QL were as same as in the
Experiment I (Table 2). Figure 7 shows the scene of
teaching process where angles of 8 joints were changed
by the instructor and they were recorded as a time series
data as a teach signal of a behavior. T he vo ic e i nstr uct i on
learning and recognition results also achieved 100.0% of
successful rate, and the details are omitted here for the
limit of space.
(a) Raise a hand (b) R a is e 2 h a nds (c) A handclap
Figure 6. L eaning results of a complex behavi or dance”.
Table 3. Parameters used in RNNPB in Ex periment II.
Description Symbol Value
The number of nodes in Input layer N 8
The number of nodes in Output layer N 8
The number of nodes in Hidden layer H 20
The number of nodes in PB layer P 2
The number of nodes in Context layer M 30
Learning rate of B P for c onnections β 0.01
Learning coeffi cient of PB nodes
1
η
0.1
Learning rate of PB nodes
2
η
0.9
Length of time ser ies
(width of time window)
l 20
Internal coefficients f PB nodes k1, k2 0.8, 0.5
Gradient of sigmoid function
α
2.0
Gradient of sigmoid function f or PB
pb
α
5.0
Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid
Robot Using RNNPB and Reinforcement Learning
Copyright © 2011 SciRes. JSEA
133
Figure 7. L eaning results of a complex behavi or dance”.
4. Conclusion
The combination of a recurrent neural network with bias
parameters (RNNPB) and reinforcement lear nin g algo rit hm
was proposed to realize the complex behavior learning
and generation of robot. All angles of joints of robot
were considered as the input and output of RNNPB and
their ti me series data for med kind s of patter ns of pri mitive
behaviors of robot at first, then the complex behavior of
robot were composed by the time series of different
primitive behaviors. The learning rule of RNNPB used
error back propagation through time (BPTT) method,
and to generate a series of primitive behaviors in correct
order, Q-learning (QL) was adop t in the training pro cess.
Using a humanoid robot PALRO, the proposed method
was con fir med i ts ef fecti ve ne ss b y the res ults o f t wo kinds
of experiments. The generation of primitive behaviors
showed a satisfied representation of required movement
when a certain of teach signal was added during the
generation process and a 100.0% of success rate of a
complex behavior “dance” was acquired after the training
with the QL algorithm. Voice instruction learning and
recognition also reached to 100.0% success rate in the
experiment.
5. Acknowledgements
A part of this work was supported by Grant-in-Aid for
Scientific Research (JSPS 23500181) and Foundation for
the Fusion of Science and Technology (FOST).
REFERENCES
[1] V. I. Pavlovic, R. Sharma, and T. S. Huang, Visual In-
terpretation of Hand Gestures for Human-Computer Inte-
raction: A R evi ew”, IEEE Transaction on Pattern Analy-
sis and Machine Learning Intelligence, Vol. 19, No. 7,
1997, pp . 667-695.
[2] C. Nolker, H. Ritter, Parametrized SOMs for hand post-
ure reconstruction. Proceedings of IEEE-INNS-ENNS
International Joint Conference on Neural Networks
(IJCNN’00), 2000, pp. 139—144.
[3] G. Heidemann, H. Bekel, I. Bax, A. S aalbach, Hand ges-
ture recognition: selforganising maps as a graphical user
interface for the partitioning of large training d ata sets, in:
Proceedings of 17th International Conference on Pattern
Recognition (ICPR’04), 2004, pp.487490.
[4] C.-L. Huang, M.-S. Wu, S.-H. Jeng, Gesture recognition
using the multi-PDM method and hidden Markov model,
Image and Vision Computing, Vol.18, No.11, 2000,
pp.865879.
[5] R. Amit and M. Mataric, Learning Movement Se-
quences from Demonstration”, Proceedings of the 2nd
IEEE International Conference on Development and
Learning (ICDL’02), Cambridge, MA, 2002, pp.
203-208.
[6] M. Hossain, M. Jenkin, Recogni zing hand-raising gesture
using HMM, in: Proceedings of 2nd Canadian Confe-
rence on Computer and Robot Vision (CRV’05 ), 2005,
pp.405-412.
[7] G. Caridakis, K. Karpouzis, A. Drosopoulos, S. Kollias,
SOMM: Self Organizing Markov Map for Gesture Rec-
ognition, Pattern Recognition Letters, Vol. 31, No.1,
2010, pp . 52-59.
[8] H. Suk, B. Sin and S. Lee, Recognize Hand Gesture
using Dynamic Baysian Network”, Proceedings of 8th
IEEE international Conference on Automatic Face &
Gesture Recognition (FG’08), 2008, pp . 1-6.
[9] J. Tan i, “Learning to Generate Articulated Behavior
through the Bottom-Up and the Top-Down Interaction
Process”, N eural Net w orks, Vol. 16, 2003, pp. 11-23.
[10] J. Tani and M. Ito, “Self-Organization of Behavioral Pri-
mitives as Multiple Attractor Dynamics: A Robot Expe-
riment”, IEEE Transactions on Systems, Man, and Cy-
bernet ics, Vol. 33, No. 4, 2003, pp. 481-488.
[11] J. Tani, M. Ito and Y. Sugita, Self-organization of Dis-
tributedly Represented Multiple Behavior Schemata in a
Mirror System: Reviews of Robot Experiments Using
RNNPB”, Neural Netwo r ks, Vol.17, 2004, pp.12 73-1289.
[12] T. Kuremoto, Y. Kinoshita, L. B. Feng, S. Watanabe, K.
Kobayashi, M. Obayashi, A Gesture Recognition Sys-
tem with Retina-V1 Model and One-Pass Dynamic Pro-
gr amming”, Neurocomputing, 2012, in press, doi:
http://dx.doi.org/10/1016/j.neucom.2012.03.27
[13] G. rizzolatti and L. Craighero, The MiNror-neuron Sys-
tem”, Annual Reviews of Neuroscience, Vo l. 27 , 20 04 , p p.
169-192.
[14] E. Oztop, M. Kawato and M. Arbib, “Mirror Neurons and
Imitation: A Computationally Guided View”, Neural
Networks, Vol. 19, 200 6, pp. 254-271.
[15] R. S. Sutton and A. G. Barto, Reinforcement Learning:
An introduction”, The MIT Press, Cambridge, 1998.
[16] J. L. Elman, Finding Structure in Time”, Cognitive
Science, Vol. 14, 1990, pp. 179-211.
[17] D. Rumelhart, G. E. Hinton and R. J. Williams, Learn-
ing In ternal R epr esent ation s by Back-Propagation Errors”,
Nature, Vol. 233, 19 86, pp. 533-536.
[18] M. I. Jordan, Attractor Dynamics and Parallelism in a
Connectionist Sequential Machine”, IEEE Computer So-
ciety Neural Networks Technology Series, 1990, pp.
112-127.