Paper Menu >>
Journal Menu >>
A Journal of Software Engineering and Applications, 2012, 5, 128-133 doi:10.4236/jsea.2012.512b025 Published Online December 2012 (http://www.scirp.org/journal/jsea) Copyright © 2012 S ciRes. JSEA Multiple Action Sequence Learning and Automatic Generation for a Humanoid Robot Using RNNPB and Reinforcement Learning Takashi Kuremoto1, Koichi Hashiguchi1, Keita Morisaki1, Shun Watanabe1, Kunikazu Kobayashi2, Shingo Mabu1, Masanao Obayashi1 1Graduate School of Science and Engineering, Yamaguchi University, Ube, Yamaguchi, Japan; 2School of Information Science and Technology, Aichi Prefectural University, Nagaku te, Aichi, Japan. Email: wu@yamaguchi-u.ac.jp Received Month Day, Year (2012). ABSTRACT This paper proposes how to learn and generate multiple action sequences of a humanoid robot. At first, all the basic action sequences, also called primitive behaviors, are learned by a recurrent neural network with parametric bias (RNNPB) and the value o f the inter nal nodes whic h are para metric b ias (PB) d etermining the ou tput with differe nt pri- mitive behaviors are obtained. The t ra ini ng of t he RNN us es b a ck p r op a gat io n t hro ug h time ( B PT T ) met ho d . After that, to generate the learned behaviors, or a more complex behavior which is the combination of the primitive behaviors, a reinfo rce ment lear ni ng algorithm: Q-learning (QL) is adopt to determine which PB value is adaptive for the generation. Finally, us ing a real humanoid robot, the proposed method was confirmed its effectiveness by the res ults of experiment. Keywords: RNNPB; Humanoid Robot; BPTT; Rei nfor ceme nt Lea r ni ng ; Multiple Action Sequences 1. Introduction To recognize, learn, and generate adaptive behaviors for an intell igent social robot is a char ming theme and it has been attracting researchers more than a decade. From the view that dynamic complex behaviors of the robot are composed by the spatiotemporal changed actions which are so-called “primitive behaviors”, or “ele ment actions”, gesture recognition has been approached by lots of me- thods such as 3D models [1], self-or ga niz i ng map ( S OM ) [2,3], hidden Markov model (HMM) [4-7], dynamic Baysian network (DBN) [8], recurrent neural network (RNN) [9-11], and dynamic programming (DP) [12]. Tani with his colleagues proposed a RNN with para- metric bias (RNNPB) which realize not only recognition of mult iple b eha viors b ut also lea rning a nd ge nerati on of them, based the finding of mirror neuron system in the brain [13,14]. The input of RNNPB includes sensory (visual or auditory) information and teacher’s motor in- formation during the learning period, and the imitative behaviors are output (generated) by the network accord- ing to the observation of robot in the period of genera- tion. In this paper, we propose to combine RNNPB and reinforcement learning (RL) [15] to realize (i) the mul- tiple behaviors automatic generation or (ii) by the in- struction of a human instructor. In another word, the adaptive PB values are determined as the res ult of R L in the generation process. Various patterns of primitive behaviors are learned by back-propagation through time (BPTT) [16] [17], and PB vectors are obtained as the result. Considering the PB vectors as finite states of a Markov decision process (MDP), a complex behavior can be learned as an optimal state transition process of these primitive behavior p atterns using the RL algorithm such as Q-Learning. Using a humanoid robot “PALRO” (Fujisoft Inc., 2010), experiments results confirmed the effectiveness of the proposed meth od. 2. Proposed System Multiple behavior instruction learning and complex behavior learning system for a robot is proposed here. It works as following process: (i) Time series data of angles of robot’s joint s for p rimitive behavi ors ar e given by a user (instructor) of the robot and they are recorded as teacher signals; (ii) Train a recurrent neural network with parameter bias (RNNPB) [9-11] with error back- propagation method [16,17], which output ar e ti me series angles of joints when arbitrary initial angles are set as Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid Robot Using RNNPB and Reinforcement Learning Copyright © 2011 SciRes. JSEA 129 input of the network, to meet the different patterns of primitive behaviors; (iii) Explore the temporal order of different parameter bias (PB) vectors, which yields different primitive behaviors, by the reinforcement lear nin g ( R L) a l go ri t hm [15] . The details of the proposed method are given in this secti on. 2.1. RNNPB The recurrent neural network with parametric biases (RNNPB) [9-11] is a Jordan-type recurrent feed forward neural network [18] with three kinds of internal layers: hidden layer, context layer and parametric bias (PB) layer (Figure 1). Nodes in Hidden layer and Context layer ha ve their internal states with sigmoid fu nctio n: z z α − + =e f1 1 )( . (1) where α, a positive constant, is the gradient of the function, and z is the inp ut vector for the node. Specially, the input vector zh for the Hidden layer nodes: ccpbpbiih vuvuvu z++= . (2) where ui = x(t), upb, uc and vi, vpb, vc are the output and the connection weight of Inpu t layer, P B layer and lower Context layer r espec tively. The input vectors zo and zc for Output layer and Con- text layer are given b y: oho twuz)x( ==+1 , (3) chc wuz =. (4) Hidden layer Output layer Input layer Context layer PB layer Context layer x(t) x(t+1) xc(t+1) xc(t) (uh, wc) (uh, wo) (ui, vi) (upb,vpb) (uc,vc) Figure 1. The structure of RNNPB. Internal layers are ex- pressed in g ray co l or an d c o nn e ct ion s wit h sy napt ic wei g ht s between lay ers are depicted wi th broken a rrow lines. C ontext layers are same one but with temporal varied values of its internal state, input and output. where uh = f (zh) is the output of Hidden layer given by Equation (1), and wo, wc are the connection weights be- twee n t he Hidd e n la ye r and t h e O utput la ye r , t he Contex t layers, respectively. For the nodes of PB layer, its internal state upb changes with the delta errors bp t δduring a period (a time series window) l, when the network is trained by the error back-propagation (BP) method [16,17]: 1 2 11 2 2/ 2/ 11 )]2([ −−+ + − ∆++−+=∆ ∑tpb tpb tpb tpb 1t lt bp t tpb kk uuuuδu ηη . (5) where 2121,,,kk ηη are learning coefficient, learning rate, and internal coefficient of PB nodes. The modificatio n of connec tio n weight s is executed b y the back-propagation through time (BPTT) [16,17], that is, errors between the output of the network and the teacher data are used to adjust the weights of connections. Detail formula is omitted here. 2.2. Q-Learning algorithm Reinforcement learning (RL) is a kind of active learning method which makes a learner finds its optimal action policy by an iterative process of exploration and exploit- tation [15] . For a process of finite state transitio n, usua ll y a Markov decision process (MDP), that is, the transition is and is only decided by the transition pr obability of the last state, RL intends to find t he optimal transition prob- abilities by adopting value functions of states and state- action pairs. The state-action value function, usually called Q function, serves as an index variable in a sto- chastic function of action selection policy. In this study, we use a trad itional RL name d Q-learni ng (Q L) [15] and its learning algor ithm is as follows. QL algorithm: Step 1 Initialize Q(s,a)=0.0, where s,a are available finite state space, and action space of the robot re- spectively. Step 2 Observe the state s of the environment around the learner. Step 3 Select an action a to change the state according to a stocha stic function. For example, selec t an action which has the highest value of ),( asQ function dealing with the current state, with a big probability and select other ca ndidate action s with a small value ε . (Notice that if the num- ber of actions is A, then t he sele ction pr obabil- ity of the highest Q value is A/1 εε +− , and the selection probability of any other action is 10,/ ≤≤ εε A . Step 4 Receive reward/punishment R from the environ- ment/instructo r . Step 5 Renew ),( asQ as following: Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid Robot Using RNNPB and Reinforcement Learning Copyright © 2011 SciRes. JSEA 130 )]Q(- )maxQ(QQ s,aa,sRasas ′′ ++← γλ [),(),( . (6) where 1,0 ≤≤ γλ are learning rate and discount rate respectively. Step 6 Repe at Step 2 to Step 5 until the val ue of ),( asQ converged. The state space in our system is defined as different PB vectors, and the action space of QL is also these PB vectors fixed after the BP learning process. So the optimal state transition process approvals correct combination of primitive behaviors to be a complex behavior of robot, or the correct executio n of the pri mitive behavior. 3. Experiments The proposed method was applied to a complex behavior learning and generation of a humanoid robot named “PALRO” (Fujisoft, Inc., 2010) a s s hown in Figure 2. There are 20 joints (actuators) in PALRO (arms, legs, neck, a nd body) and the control of these angles of joints in time series composes various actions of the robot. Two kinds of expe riments were des igned: Experiment I: a time series angles of joints yield a primitive behavior, such as raising a hand, or turning to left/rig ht, and severa l primitive behaviors yield a complex Figure 2. A robot used in the experiment: PALRO, product of Fujisoft, Inc., 2010. (a) Pat t e rn 1 (b) Pattern 2 (c) Pat tern 3 (d) Pattern 4 Figure 3. P ri mitive be ha vi ors of the robot in ex perime nt I. behavior of the robot such as a “dance” behavior; Experiment II: 3 kinds of voice instructions corres- ponding to 3 kinds of behaviors of robot were learned and recognized. Details of the experiments and results were described in this section. 3.1. Experiment I: Complex Behavior Learning and Generation 3.1.1 Primitive Behavior Learning / Generation We designed 4 kinds of patterns of primitive behaviors of the robot as shown in Figure 3: (a) Turn to left and shake the hands; (b) Turn to right and shake the hands; (c) Turn to right and shake the hands; (d) Raise the left hand and stop in a special pose. Angle vector with 20 dimensions served the input of RNNPB, that is, the number of nodes in Input layer and Output layer was 20 respectively. Teach signals of the primitive behaviors were recorded by the storage function of the robot, that is, the time series values of angles of the movements (pri- mitive behaviors) obtained by the drive of an instructor. Parameters of RNNPB and its learning process used in the experiment are listed in T a ble 1. Training results of RNNPB for the 4 primitive beha- viors are shown in Figure 4, where (a) shows the learn- ing cur ve (I te r ati o n ti me vs. R M SE ) ; ( b) PB va lue s o f the 4 patterns of behaviors; (c)-(f) time series values of 20 angles o f b e havior s ( ge nerati o n wi t h 3 0 % te acher si gnals). The time interval “step” was set with 0.1 second/step in (c)-(f). Table 1. Parameters used in RNNPB in Exper iment I. Description Symbol Value The number of nodes in Input layer N 20 The number of nodes in Output layer N 20 The number of nodes in Hidden layer H 30 The number of nodes in PB layer P 2 The number of nodes in Context layer M 30 Learning rate of B P for c onnections β 0.02 Learning coeffi cient of PB nodes 1 η 0.9 Learning rate of PB nodes 2 η 0.9 Length of time series (width of time window) l 20 Internal coefficients f PB nodes k1, k2 0.9, 0.9 Gradient of sigmoid function α 2.0 Gradient of sigmoid function f or PB pb α 5.0 Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid Robot Using RNNPB and Reinforcement Learning Copyright © 2011 SciRes. JSEA 131 In fact, if the teacher signals were not added during the gener ation p rocess, that is, t he inp ut signa l was gi ven by following equation: )1()1()1()( −+−−= trtrt d xxx (7) where x(t-1) is the output of the network on time t-1, xd(t-1) is the teacher data and r is the ratio o f the teacher signal. When r = 0.0, the output of the net work was ea sily to fall in a static state and this problem needs to be im- proved in the future study. 3.1.2. Co mplex Behavior Learning / Generation Using the Q-Learning algorithm (QL) described in last section, we decided the required orders of the primitive patterns to compose the complex behavior: a “dance” of robot. The QL was defined with 4 states, that is, 4 values of PB nodes and 4 actions as same as these PB values. The training results gave the order of PB values used in the generation process of robot as follows: “Pattern 3 - P attern 1 – P attern 2 – Pattern 3 – Pattern 4”. The reward to reinforce the adaptive transition was (a) Learning curve of RNNPB Pattern 1 + Pattern 2 × Pattern 3 * Pattern 4 □ (b) PB values of 4 patterns (c) Angles for Pattern 1 (d) Angles for Patter n 2 (e) Angles for Pattern 3 (f) Angles for Patter n 4 Figure 4. Leaning results of primitive behaviors. Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid Robot Using RNNPB and Reinforcement Learning Copyright © 2011 SciRes. JSEA 132 input by the voice of instructor. “Good” said by the in- structor indicated the value of reward R = 0.1, and “Bad” meant R = -0.1. Parameters used in the QL are listed in Table 2. The learning curve (Iteration times vs. Success rate; 10 experiments averages) and final time series val- ues of angles of the complex be havior “dance” are shown in Figure 5. The success rate of the dance composed by the fixed order of primitive behaviors reached to 100.0% after 12 tria ls o f QL. The complex behavior generation movie is shown on the WWW site: http://www.nn.csse.yamaguchi-u.ac.jp/images/researc h/Palro11Hashiguchi.wmv Table 2. Parameters used in Q-Learning. Description Symbol Value Learning rate of Q λ 0.1 Discount rate γ 0.9 Reward (positiv e / negative) R 0.1/ -0.1 Rate of random action ε 0.1 (a) Learning curve of QL (b) Time series values of 20-joint angles to rea lize the “dance” as a compos ition of 4 primitive behaviors. Figure 5. L eaning results of a complex behavi or “dance”. 3.2. Ex perimen t II: Behav ior Instru ct ion Learning and Recognition Voice instruction can be captured and recognized by the recorder and microphone of PALRO. However, special behaviors need to be learned by the instructor and the learning system RNNPB, and the relationship between PB values and the voice instructions is able to be decided by QL algorithm as sa me as the situa tion of order deci sion of primitive behaviors for compl ex b e ha vio r le ar ni ng a nd generation. In this experiment, we designed 3 kinds of behaviors for PALRO which static picture are shown in Figure 6: (a) Shake a hand; (b) Raise 2 hands; (3) A handclap. Because the behaviors were limited in several joints, 8 input / output nodes were designed in RNNPB, and other parameters used in the experiment II are listed in Table 3. P ar ameters used in QL were as same as in the Experiment I (Table 2). Figure 7 shows the scene of teaching process where angles of 8 joints were changed by the instructor and they were recorded as a time series data as a teach signal of a behavior. T he vo ic e i nstr uct i on learning and recognition results also achieved 100.0% of successful rate, and the details are omitted here for the limit of space. (a) Raise a hand (b) R a is e 2 h a nds (c) A handclap Figure 6. L eaning results of a complex behavi or “dance”. Table 3. Parameters used in RNNPB in Ex periment II. Description Symbol Value The number of nodes in Input layer N 8 The number of nodes in Output layer N 8 The number of nodes in Hidden layer H 20 The number of nodes in PB layer P 2 The number of nodes in Context layer M 30 Learning rate of B P for c onnections β 0.01 Learning coeffi cient of PB nodes 1 η 0.1 Learning rate of PB nodes 2 η 0.9 Length of time ser ies (width of time window) l 20 Internal coefficients f PB nodes k1, k2 0.8, 0.5 Gradient of sigmoid function α 2.0 Gradient of sigmoid function f or PB pb α 5.0 Multiple Action Sequence Learning and Automatic Generatio n for a Humanoid Robot Using RNNPB and Reinforcement Learning Copyright © 2011 SciRes. JSEA 133 Figure 7. L eaning results of a complex behavi or “dance”. 4. Conclusion The combination of a recurrent neural network with bias parameters (RNNPB) and reinforcement lear nin g algo rit hm was proposed to realize the complex behavior learning and generation of robot. All angles of joints of robot were considered as the input and output of RNNPB and their ti me series data for med kind s of patter ns of pri mitive behaviors of robot at first, then the complex behavior of robot were composed by the time series of different primitive behaviors. The learning rule of RNNPB used error back propagation through time (BPTT) method, and to generate a series of primitive behaviors in correct order, Q-learning (QL) was adop t in the training pro cess. Using a humanoid robot “PALRO”, the proposed method was con fir med i ts ef fecti ve ne ss b y the res ults o f t wo kinds of experiments. The generation of primitive behaviors showed a satisfied representation of required movement when a certain of teach signal was added during the generation process and a 100.0% of success rate of a complex behavior “dance” was acquired after the training with the QL algorithm. Voice instruction learning and recognition also reached to 100.0% success rate in the experiment. 5. Acknowledgements A part of this work was supported by Grant-in-Aid for Scientific Research (JSPS 23500181) and Foundation for the Fusion of Science and Technology (FOST). REFERENCES [1] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Visual In- terpretation of Hand Gestures for Human-Computer Inte- raction: A R evi ew”, IEEE Transaction on Pattern Analy- sis and Machine Learning Intelligence, Vol. 19, No. 7, 1997, pp . 667-695. [2] C. Nolker, H. Ritter, Parametrized SOMs for hand post- ure reconstruction. Proceedings of IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN’00), 2000, pp. 139—144. [3] G. Heidemann, H. Bekel, I. Bax, A. S aalbach, Hand ges- ture recognition: selforganising maps as a graphical user interface for the partitioning of large training d ata sets, in: Proceedings of 17th International Conference on Pattern Recognition (ICPR’04), 2004, pp.487—490. [4] C.-L. Huang, M.-S. Wu, S.-H. Jeng, Gesture recognition using the multi-PDM method and hidden Markov model, Image and Vision Computing, Vol.18, No.11, 2000, pp.865–879. [5] R. Amit and M. Mataric, “Learning Movement Se- quences from Demonstration”, Proceedings of the 2nd IEEE International Conference on Development and Learning (ICDL’02), Cambridge, MA, 2002, pp. 203-208. [6] M. Hossain, M. Jenkin, Recogni zing hand-raising gesture using HMM, in: Proceedings of 2nd Canadian Confe- rence on Computer and Robot Vision (CRV’05 ), 2005, pp.405-412. [7] G. Caridakis, K. Karpouzis, A. Drosopoulos, S. Kollias, SOMM: Self Organizing Markov Map for Gesture Rec- ognition, Pattern Recognition Letters, Vol. 31, No.1, 2010, pp . 52-59. [8] H. Suk, B. Sin and S. Lee, “Recognize Hand Gesture using Dynamic Baysian Network”, Proceedings of 8th IEEE international Conference on Automatic Face & Gesture Recognition (FG’08), 2008, pp . 1-6. [9] J. Tan i, “Learning to Generate Articulated Behavior through the Bottom-Up and the Top-Down Interaction Process”, N eural Net w orks, Vol. 16, 2003, pp. 11-23. [10] J. Tani and M. Ito, “Self-Organization of Behavioral Pri- mitives as Multiple Attractor Dynamics: A Robot Expe- riment”, IEEE Transactions on Systems, Man, and Cy- bernet ics, Vol. 33, No. 4, 2003, pp. 481-488. [11] J. Tani, M. Ito and Y. Sugita, “Self-organization of Dis- tributedly Represented Multiple Behavior Schemata in a Mirror System: Reviews of Robot Experiments Using RNNPB”, Neural Netwo r ks, Vol.17, 2004, pp.12 73-1289. [12] T. Kuremoto, Y. Kinoshita, L. B. Feng, S. Watanabe, K. Kobayashi, M. Obayashi, “A Gesture Recognition Sys- tem with Retina-V1 Model and One-Pass Dynamic Pro- gr amming”, Neurocomputing, 2012, in press, doi: http://dx.doi.org/10/1016/j.neucom.2012.03.27 [13] G. rizzolatti and L. Craighero, “The MiNror-neuron Sys- tem”, Annual Reviews of Neuroscience, Vo l. 27 , 20 04 , p p. 169-192. [14] E. Oztop, M. Kawato and M. Arbib, “Mirror Neurons and Imitation: A Computationally Guided View”, Neural Networks, Vol. 19, 200 6, pp. 254-271. [15] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An introduction”, The MIT Press, Cambridge, 1998. [16] J. L. Elman, “Finding Structure in Time”, Cognitive Science, Vol. 14, 1990, pp. 179-211. [17] D. Rumelhart, G. E. Hinton and R. J. Williams, “Learn- ing In ternal R epr esent ation s by Back-Propagation Errors”, Nature, Vol. 233, 19 86, pp. 533-536. [18] M. I. Jordan, “Attractor Dynamics and Parallelism in a Connectionist Sequential Machine”, IEEE Computer So- ciety Neural Networks Technology Series, 1990, pp. 112-127. |