An Experience Based Learning Controller

doi:10.4236/jilsa.2010.22011

Paper Menu >>

Journal Menu >>

J. Intelligent Learning Systems & Applications, 2010, 2: 80-85

doi:10.4236/jilsa.2010.22011 Published Online May 2010 (http://www.SciRP.org/journal/jilsa)

An Experience Based Learning Controller

Debadutt Goswami, Ping Jiang*

Department of Computing, University of Bradford, Bradford, UK.

Email: p.jiang@bradford.ac.uk

Received November 10th, 2009; revised April 10th, 2010; accepted April 20th, 2010.

ABSTRACT

The autonomous mobile robots must be flexible to learn the new complex control behaviours in order to adapt effec-

tively to a dynamic and varying environment. The proposed approach of this paper is to create a controller that learns

the complex behaviours incorporating the learning from demonstration to reduce the search space and to improve the

demonstrated task geometry by trial and corrections. The task faced by the robot has uncertainty that must be learned.

Simulation results indicate that after the handful of trials, robot has learned the right policies and avoided the obstacles

and reached the goal.

Keywords: Mobile Robots, Learning from Demonstration, Neural Network Control, Reinforcement Learning

1. Introduction

Most of the autonomous mobile robots or the human

controlled mobile robots working in the industrial, do-

mestic and entertainment environment lack the flexibility

to learn and perform the new complex tasks. In this paper

we consider the problem of motion control of an auto-

nomous mobile robot using the control field generated by

a neural controller. The proposed system enables the

autonomous mobile robot to learn, modify and adapt its

skills efficiently in order to react adequately in complex

and unstructured environment. To incorporate the preci-

sion and flexibility in robots, we incorporated three very

common natural approaches that, human uses to learn and

execute any task, 1) Learning from demonstration 2)

Generalization of the learned task 3) Reinforcement of

the task by trial and error [1,2]. The designing of a mo-

tion controller depends on the kinematics constraint of a

mobile robot and the complexity and structure of a task,

and is very difficult to design whereas the proposed con-

troller can easily and quickly learn the complex controls

with some simple demonstrations and few trials. The

major advantage of our technique is that any mobile robot

can be taught to move in an unknown environment using

the generated control field as the general control law. The

control law is independent of both physical and virtual

sensors. The use of domain knowledge has a great sig-

nificance in reducing the learning time and the use of trial

and error learning method improves the learning con-

tinuously [3]. The learning from demonstration has re-

duced the robot programming and made it very simple to

use [4,5]. The reinforcement learning has already been

used in many navigational and reactive problems [6,7,8].

The ability of reinforcement learning to interact with the

environment and the domain knowledge from demonstra-

tions has made it easier to tackle uncertainty. Due to the

simplicity to operate and flexibility to learn uncertainty,

the proposed controller can be operated by the non-pro-

fessionals who are not skilled enough to control and pro-

gram the sophisticated mobile robots in the complex in-

dustrial environment [3,9].

The paper is presented in following manner. Section II

presents the previous works related to the field. Section

III presents the controller architecture and the proposed

algorithm. Section IV presents the simulation results.

Section V presents conclusions and discusses the work

further.

2. Related Works

We have incorporated two different research areas:

Learning from demonstration and Reinforcement Learn-

ing, which bears some relation to a number of previous

works. The importance of human robot interaction and

the need of modern artificial intelligence are described in

[10] that focus on creating new possibilities for the flexi-

ble and interacting robot from the engineering point of

view and from the humanistic point of view. The paper

incorporates these ideas and focuses on flexible learning

controller of a robot which can learn and adapt the

changes in the environment very easily. The proposed

methodology differs from [3,11,12] and [5] in two fun-

damental aspects. Firstly our approach learns from dem-

onstration, which acts as initial policy to perform the de-

An Experience Based Learning Controller81

sired task and builds a model of partially observable en-

vironment. Secondly, the methodology is offline and

online computations are done on the basis of generated

control field and the previous experience. The proposed

controller initializes the weights of the controller from

the demonstrated points of the task whereas in [11] the

learner is initially empty. The proposed controller utilizes

Q-values to avoid uncertainty where as [3] and [11] used

fuzzy behaviour as a reflex to act into a new situation.

The aim is to generate an abstract description of the task

reflecting user’s intention and modelling the problem

solution offline. In this context we also discuss the issues

of evaluation, tuning, suitable generalization and execu-

tion of the elementary skills [9,13-15]. Instead of using

radial basis as a function approximation [3,12], we used

inverse distance interpolation. It works like local ap-

proximators. We also used learning through feedback,

which improves the learning and generalization continu-

ously.

3. Controller Architecture

The proposed learning controller can be represented as

recurrent neural network architecture. The controller has

basically 3 layers, input layer, hidden layer and output

layer and a reinforcement controller which is connected

to the output layer and the hidden layer.

SR and

R are the sensor input and output. The weight is

updated when the sensor and the controller try to learn an

unknown model from sensors to actuators, i.e.

()

fS (1)

Input Layer—Input layer consists of several neural in-

put nodes that represent some state in the state space i.e.

S. The sensor input is propagated through the hidden

layer from the input layer to the output layer. Each neural

node in the input layer is connected to all the neural

nodes in the hidden layer.

Hidden Layer—Hidden layer is composed of several

computational neurons. The output of the hidden neurons

is the distance between the input layer neuron and the

weight similar to a prototype fuzzy law. The weight

in the hidden layer neuron represents the centre i.e. the

demonstrated data points. Each neural node of the input

layer is connected to all the neural nodes of the hidden

layer. The output of the hidden layer codes the distance

of the input neurons from the hidden layer neurons.

Output layer—The output layer neuron computes the

inverse weighted sum of the output of the hidden

layer neurons i.e.

oi i





 (2)

The output layer neuron uses inverse distance interpo-

lation to update the weights of the output layer.

Figure 1. Neural network control architecture

The advantage of using this technique is that, the esti-

mated data point has the influence of all the neighbouring

data points depending on the distance from each proto-

type rule. This is the simplest generalized technique and

mathematically expressed as

(') (')

Nv Nv



















(3)

where, is the estimated control action in term of a

sensor state ; is a prototyped action of the hidden

layer neuron at state ;

d is the distance between S

and S’; p is the power; N(v’) is the number of data points

in the effective window of v’. Generally the distance can

be calculated by simple Euclidean distance formula. The

above generalization technique only acts as a function

approximation that reproduces the generalized control

field based on the distance from the existing prototype

rule. But to tackle uncertainty and to produce an im-

proved control field, the robot needs to incorporate a

weight that represents the significance of the previous

learning history and incorporates continuous learning

experiences. The idea is to improve the learning by trial

and error. So Q-learning [16] is incorporated, which pro-

duces Q-matrix based on the scalar reward R. The

Q-value determines the new policy to achieve the goal.

Therefore in the neural network architecture, the output

layer is connected to the reinforcement controller which

in turn is connected to the hidden layer. Depending on the

achievement of the desired goal, the reinforcement re-

ward R is given to the robot by the reinforcement con-

troller. The reward R updates the Q-matrix and the new

Q-value is used to generate a new control field. The rein-

An Experience Based Learning Controller

forcement controller follows the online exploration and

offline learning policy [17]. So the Q-value is updated in

following two cases:

Case 1: Online exploration—During online exploration

the robot uses Q-matrix to avoid uncertainty like un-

known obstacles. Therefore, whenever the robot finds

uncertainty in its next course of action, the control

switches from the generalized control field to Q-matrix

and selects an action with other maximum Q-value from

its effective window that corresponds to a policy with

next higher probability to reach the goal and avoids the

uncertainty. This earns an online reward R and updates

the Q-value online in that time step. The change in the

action in that time step is used as a prototype rule i.e.

center in the hidden layer.

Case 2: Offline learning policy—In the end of the trial

during offline learning policy, the reinforcement reward

R is given offline to the robot. The reward R updates the

corresponding Q-values of the trajectory based on the

final payoff. After updating Q-value, the change is ap-

plied to the neurons in the hidden layer of the network

architecture. So another factor that influences the output

of the hidden layer is the Q-value.

The Q-value is then used as a measure to calculate the

relevance of the data points in terms of usefulness of that

particular task geometry in achieving the goal. The change

in the Q-matrix influences all the neighbouring data

points in the control field. This is called the generaliza-

tion. After that the control field is transferred back to the

robot. In every trial Q-matrix is updated and reflects the

continuous learning. The steps are repeated until we get

the desired control field. This technique of learning is

known as online exploration and offline policy learning.

The reward scheme for online exploration and offline

policy learning is as follows:

1 If the goal is achieved or obstacle is aovided

0 If no obstacle or no goal

1 If failed to achieve goal or unable to aovid obstacle













(4)

The reward function can represented as:

uncertainty goal

RR R (5)

In online exploration

oal

R is set to zero and

is awarded online to avoid obstacles on its path whereas

in offline learning policy

uncertainty

oal

R is awarded offline and

is set to zero.

uncertainty

The Q update rule can be described in two steps.

Step 1: Give Reward R using (4) and (5) and update

the weight using the weight update Equation.



 



,, ,

Qsa

QsaRsaQsa Qsa





 



Here





Qsa is the weight matrix to be updated; (s,

a) is the current state and action; is the reward

matrix;



,Rsa





Qsa is the future state with the highest

value;



is the learning rate;



is the discount factor in

the range [0, 1]. The weight update Equation (6) is inher-

ited from the traditional Q-learning algorithm [16].

Step 2: Estimate the data points based on the

calculated in step 1. The new estimate updates the previ-

ously generated control field and generates a new gener-

alized control field. To implement our idea we modify (3)

by introducing .

'vw

(') (')

Nv Nv



















(7)

Here w

d is the weight. A small value repre-

sents the small contribution of the point in accomplishing

the goal where as a large value signifies large con-

tribution of the point in accomplishing the goal. is

inversely related to the distance, which signifies that the

will have higher value when the estimated data point

is closer to the good trajectories. The trajectories with

lower value have less contribution in the overall

generalization.



(6)

In the proposed context behaviour is improved based

on scalar rewards from a critic. It does not require a per-

son to program the desired actions in different situations.

However several difficulties stand in the way. Firstly a

robot using the basic reinforcement-learning framework

might require an extremely long time to converge to the

right track and to acquire the right action. In our frame-

work learning from demonstration is used that reduces

the search space to achieve an adequate control and it

does not need any programming of the desired behaviour.

The demonstration provides the domain knowledge, that

gives hint to perform the desired task and reduces

knowledge base significantly. Through trial and error the

knowledge base can be improved further [18,19]. Sec-

ondly reinforcement learning is basically applied on the

task where the state of the environment is completely

observable, but in the real world tasks most of the

knowledge is incomplete and inaccurate. The online ex-

ploration and offline learning policy explores the un-

known environment with the help of generalized control

field and tackles the uncertainty by selecting the appro-

priate control action which signifies maximum experi-

ence i.e. maximum Q-value in the effective window in

achieving the desired goal. The trajectory generated by it

changes the control field and tackles the uncertainty in

the environment.

An Experience Based Learning Controller83

Ideally for a real robot any learning algorithm should

be feasible and efficient to perform the complex compu-

tations, keeping in mind the robots internal configuration

under consideration. A robot has limited memory and

limited computational ability. Almost all the cases of

practical interest consist of far more state observation

pairs than could possibly entered into the memory. The

robot is even typically unable to perform enough compu-

tation per time step to fully use it. Especially in the real

world experiences, which consist of large number of ac-

tion and observation pairs and at each step the robot

needs to memorize and learn the action-observation pairs.

Thus a good approach is to compute and store a limited

number of data online which are relevant and learning the

policy offline. In the proposed algorithm the old data can

be used and new experiences can be collected, which are

somewhat characteristic for the task. The best thing is the

experience replay and the assessment of the performance

of the previous experiences. Another important step is the

generalization of the learned policy. The policy learned in

this way keeps the learning continuous. This kind of

learning is generally called as learning from experience.

For a robot controller, it is also important to reduce the

real time computation. The robot takes action on the gen-

erated control field only. Hence learning is made con-

tinuous and the computational overhead is also reduced.

On the other hand the exploration is done online. The

robot decides between exploration and exploitation based

on the number of times an action has been chosen. The

robot generally chooses action with the highest state ac-

tion value with high probability, which is known as ex-

ploitation. If the robot oscillates in a particular region

then exploitation is stopped and control switches to ex-

ploration to avoid oscillation and finds a new path. Dur-

ing exploration the selection of next state is made random.

This approach has a major advantage of not being influ-

enced by the limited knowledge only i.e. to avoid local

optima. During exploration the robot can take the control

actions which were never taken before and hence explor-

ing more of the unobserved environment. The balance

between exploration and exploitation is necessary, be-

cause exploitation is helping the robot to avoid uncer-

tainty and to reach the goal where as exploration is

avoiding all the oscillation and unnecessary iterations. So

this methodology has a very positive approach to learn

any new skill quickly. The robot can be restrained easily

to adapt to environment changes and trained and learned

improving the performance all the time.

The proposed algorithm can be summarized as below:

1) Build the control field by demonstrating the task, i.e.

initializing the weights in the hidden layer with the dem-

onstrated data points and imitating the relevant knowl-

edge required to achieve the task.

2) Based on the demonstrated control field, using in-

verse distance weight interpolation, generalize the field

using (7).

3) Transfer the generalized control field to the robot

and do some trials to achieve the goal. Do some explora-

tion also. If all the trials are successful in achieving the

goal then the demonstrated control field is perfect and

does not need any changes. If in trials, at some point the

robot hits or senses any obstacle then give reward

to the point using (4) and (5) and move to-

wards the point that has next highest Q-value, without

any obstacle. The next higher Q-value represents the next

most experienced and favorable point heading towards

the goal. Then update the Q-values using (6). This will

generate a new path to reach the goal. After the goal has

been reached, generalize the control field offline using

(7). The control field will have influence from the points

updated. This will reproduce an updated control field.

uncertainty

4) After reaching the goal, assign final reward

oal

to the robot using (4) and (5). This reward represents the

successful completion of the desired task avoiding all the

obstacles and is made offline. Update all the Q-values of

the updated control trajectory generated in step 3 using

(6).

5) Generalize the control field with the new updates

using (7). The updated field represents the new control

field. Repeat from Step 3 to 5 until the robot learns the

desired task.

4. Simulation Results

In simulation we build a discrete workspace with a di-

mension of 20 × 20. The workspace contains a goal and

obstacles, represented by the ‘G’ and the rectangles. Ro-

bot has a sensor of 3 × 3 neighbourhoods which means it

can sense obstacles around its 8 neighbouring cells. The

arrows represent the direction of each point in the field.

The simulation is done based on the proposed algorithm

in section 3. The robot was initially demonstrated to reach

the goal. Based on the demonstration initial control field

was built (Figure 2). The simulation considered two dif-

ferent tasks with different uncertain complexities with

same initial control field.

Task 1: The robot had to avoid an obstacle and reach

the goal.

Task 2: The robot had to avoid obstacles and pass

through the door and reach the goal.

In both the cases robot completed the tasks in few tri-

als avoiding all the uncertainties. The final control fields

for task 1 and task 2 are shown in Figures 3 and 4.

The learning history of the controller for both the cases

is described in Figures 5-7. The broken line curve and

the solid line curve represents first and second task.

In Figure 6 it can be observed that number of hidden

neurons increases with the number of trials. Increase in

hidden layer neurons indicates the increase in number

An Experience Based Learning Controller

Figure 2. Demonstrated control field

Figure 3. Final control field of task 1

Figure 4. Final control field of task 2

centers or prototype rules in the network. It clearly indi-

cates the learning in each trial. Figure 7 displays the plot-

ting of the total number reinforcements used in each trial.

The total reinforcement can be defined as the sum of im-

mediate reinforcement the learner receives till the robot

reaches the goal and the offline reinforcement the learner

receives after reaching the goal. It is clear in Figure 7 that

after trial 3 the controller did not use any reinforcements

to complete the tasks. The controller needed only three

trials 1-3 to learn a new control law to avoid uncertainty

in both tasks. To see the reliability of the control field we

chose different starting points. The robot was successful

in completing the tasks from all the points.

Another observation is that the number of reinforce-

ment received (Figure 7) and the number of steps taken

Figure 5. Number of steps taken to reach goal in each trial

Figure 6. Number of hidden layer neurons used in each trial

Figure 7. Number of reinforcements given in each trial

An Experience Based Learning Controller

(Figure 5) in trial 1 in task 2 (solid line curve) is very

high. It indicates that, in task 2 the maximum exploration

was done in trial 1 where as in task 1 exploration was

increasing with trials. Therefore with increase in explora-

tion reinforcement is also increasing. Furthermore, it was

during first three trials the robot attained reinforcements

(Figure 7) and a sharp growth in network (Figure 6).

5. Conclusions

This paper presented a learning strategy to generate a

control field for a mobile robot in an unknown and un-

certain environment, which integrates learning, generali-

zation, exploration and offline computation into a unified

architecture. After the learning, a robot can approach the

goal by following the control field. The important lessons

learned from the implementation included 1) imitation of

very accurate and exact action sequence is not necessary

[15]; 2) a prior knowledge is required to plan a model of

the task to support rapid learning; 3) generalization is

improved by improving the learning policies; 4) simple

method like inverse distance is adequate to generalize the

task; 5) offline learning is an important method for real

time applications to avoid the large online computations;

6) online exploration is required to explore other possi-

bilities to perform a task and to improve the quality of

learning; 7) balance between exploration and exploitation

improves the learning policies, which reduces the learn-

ing time significantly; 8) the robot can learn from few

demonstrations but it effects the learning speed. The ma-

jor disadvantage of this method is the use of position in-

formation which is not always accurate in real robots due

to inaccurate sensor information due to rotational or

translational error caused by the slippage between robot’s

wheel and the floor.

REFERENCES

[1] M. N. Nicolescus and M. J. Mataric, “Natural Methods

for Robot Task Learning: Instructive Demonstrations,

Generalization and Practice,” In Proceedings of the Sec-

ond International Joint Conference on Autonomous Agents

and Multi-Agent Systems, Melbourne, Australia, July

14-18, 2003, pp. 241-248.

[2] M. N. Nicolescus and M. J. Mataric, “Learning and Inter-

acting in Human—Robot Domains,” IEEE Transactions

on systems, Man and cybernetics

—

Part A: Systems and

Humans, Vol. 31, No. 5, 2001, pp. 419-430.

[3] G. Hailu,“Symbolic Structures in Numeric Reinforcement

for Learning Optimum Robot Trajectory,” Robotics and

Automation Systems, Vol. 37, No. 1, 2001, pp. 53-68.

[4] D. C. Bentivegna, C. G. Atkeson and G. Cheng, “Learn-

ing Tasks from Observation and Practice,” Robotics and

Automation Systems, Vol. 47, No. 2-3, 2004, pp. 163-169.

[5] G. Hailu and G. Sommer, “Learning by Biasing,” IEEE

International Conference on Robotics and Automation,

Leuvem, Belgium, 1998, pp. 16-21.

[6] J. R. Millan, “Reinforcement Learning of Goal Directed

Obstacle Avoiding Reaction Strategies in an Autonomous

Mobile Robot,” Robotics and Autonomous Systems, Vol.

15, No. 4, 1995, pp. 275-299.

[7] A. Johannet and I. Sarda, “Goal-Directed Behaviours by

Reinforcement Learning,” Neurocomputing, Vol. 28, No.

1-3, 1990, pp. 107-125.

[8] P. M. Bartier and C. P. Keller, “Multivariate Interpolation

to Incorporate Thematic Surface Data Using Inverse Dis-

tance Weighting (IDW),” Computers & Geosciences, Vol.

22, No. 7, 1996, pp. 795-799.

[9] H. Friedrich, M. Kaiser and R. Dillmann, “What Can

Robots Learn from Humans,” Annual Reviews in Control,

Vol. 20, 1996, pp. 167-172.

[10] S. Thrun, “An Approach to Learning Mobile Robot Navi-

gation,” Robotics and Automation Systems, Vol. 15, 1995,

pp. 301-319.

[11] M. Kasper, G. Fricke, K. Steuernagel and E. von

Puttkamer, “A Behaviour Based Learning Mobile Robot

Architecture for Learning from Demonstration,” Robotics

and Automation Systems, Vol. 34, No. 2-3, 2001, pp.

153-164.

[12] W. Ilg and K. Berns, “A Learning Architecture Based on

Reinforcement Learning for Adaptive Control of the

Walking Machine LAURON,” Robotics and Automation

Systems, Vol. 15, No. 4, 1995, pp. 321-334.

[13] H. Friedrich, M. Kaiser and R. Dillmann, “PBD-The Key

to Service Robot Programming,” AAAI Technical Report

SS-96-02, American Association for Artificial Intelli-

gence, America, 1996.

[14] H. Friedrich, M. Kaiser and R. Dillmann, “Obtaining

Good Performance from a Bad Teacher,” In Programming

by Demonstration vs. Learning from Examples Workshop

at ML’95, Tahoe, 1995.

[15] R. Dillmann, M. Kaiser and A. Ude, “Acquisition of Ele-

mentary Robot Skills from Human Demonstration,” In

International Symposium on Intelligent Robotics Systems,

Pisa, Italy, 1995, pp. 185-192.

[16] R. S. Sutton and A. G. Barto, “Reinforcement Learning:

An Introduction,” The MIT Press Cambridge, Massachu-

setts, 1998.

[17] B. Bakker, V. Zhumatiy, G. Gruener and J. Schmidhuber,

“A Robot that Reinforcement-Learns to Identify and

Memorize Important Previous Observations,” Proceed-

ings IEEE/RSJ International Conference on Intelligent

Robots and Systems, Las Vegas, USA, October 27-31,

2003, pp. 430-435.

[18] A. Billard, Y. Epars, S. Calinon, S. Schaal and G. Cheng,

“Discovering Optimal Imitation Strategies,” Robotics and

Automation Systems, Vol. 47, No. 2-3, 2004, pp. 69-77.

[19] S. Russell and P. Norvig, “Artificial Intelligence: A Mod-

ern Approach,” Second Edition, Prentice Hall, New Jersey,

USA, 2002.