J. Intelligent Learning Systems & Applications, 2010, 2: 80-85
doi:10.4236/jilsa.2010.22011 Published Online May 2010 (http://www.SciRP.org/journal/jilsa)
Copyright © 2010 SciRes. JILSA
An Experience Based Learning Controller
Debadutt Goswami, Ping Jiang*
Department of Computing, University of Bradford, Bradford, UK.
Email: p.jiang@bradford.ac.uk
Received November 10th, 2009; revised April 10th, 2010; accepted April 20th, 2010.
ABSTRACT
The autonomous mobile robots must be flexible to learn the new complex control behaviours in order to adapt effec-
tively to a dynamic and varying environment. The proposed approach of this paper is to create a controller that learns
the complex behaviours incorporating the learning from demonstration to reduce the search space and to improve the
demonstrated task geometry by trial and corrections. The task faced by the robot has uncertainty that must be learned.
Simulation results indicate that after the handful of trials, robot has learned the right policies and avoided the obstacles
and reached the goal.
Keywords: Mobile Robots, Learning from Demonstration, Neural Network Control, Reinforcement Learning
1. Introduction
Most of the autonomous mobile robots or the human
controlled mobile robots working in the industrial, do-
mestic and entertainment environment lack the flexibility
to learn and perform the new complex tasks. In this paper
we consider the problem of motion control of an auto-
nomous mobile robot using the control field generated by
a neural controller. The proposed system enables the
autonomous mobile robot to learn, modify and adapt its
skills efficiently in order to react adequately in complex
and unstructured environment. To incorporate the preci-
sion and flexibility in robots, we incorporated three very
common natural approaches that, human uses to learn and
execute any task, 1) Learning from demonstration 2)
Generalization of the learned task 3) Reinforcement of
the task by trial and error [1,2]. The designing of a mo-
tion controller depends on the kinematics constraint of a
mobile robot and the complexity and structure of a task,
and is very difficult to design whereas the proposed con-
troller can easily and quickly learn the complex controls
with some simple demonstrations and few trials. The
major advantage of our technique is that any mobile robot
can be taught to move in an unknown environment using
the generated control field as the general control law. The
control law is independent of both physical and virtual
sensors. The use of domain knowledge has a great sig-
nificance in reducing the learning time and the use of trial
and error learning method improves the learning con-
tinuously [3]. The learning from demonstration has re-
duced the robot programming and made it very simple to
use [4,5]. The reinforcement learning has already been
used in many navigational and reactive problems [6,7,8].
The ability of reinforcement learning to interact with the
environment and the domain knowledge from demonstra-
tions has made it easier to tackle uncertainty. Due to the
simplicity to operate and flexibility to learn uncertainty,
the proposed controller can be operated by the non-pro-
fessionals who are not skilled enough to control and pro-
gram the sophisticated mobile robots in the complex in-
dustrial environment [3,9].
The paper is presented in following manner. Section II
presents the previous works related to the field. Section
III presents the controller architecture and the proposed
algorithm. Section IV presents the simulation results.
Section V presents conclusions and discusses the work
further.
2. Related Works
We have incorporated two different research areas:
Learning from demonstration and Reinforcement Learn-
ing, which bears some relation to a number of previous
works. The importance of human robot interaction and
the need of modern artificial intelligence are described in
[10] that focus on creating new possibilities for the flexi-
ble and interacting robot from the engineering point of
view and from the humanistic point of view. The paper
incorporates these ideas and focuses on flexible learning
controller of a robot which can learn and adapt the
changes in the environment very easily. The proposed
methodology differs from [3,11,12] and [5] in two fun-
damental aspects. Firstly our approach learns from dem-
onstration, which acts as initial policy to perform the de-
An Experience Based Learning Controller81
sired task and builds a model of partially observable en-
vironment. Secondly, the methodology is offline and
online computations are done on the basis of generated
control field and the previous experience. The proposed
controller initializes the weights of the controller from
the demonstrated points of the task whereas in [11] the
learner is initially empty. The proposed controller utilizes
Q-values to avoid uncertainty where as [3] and [11] used
fuzzy behaviour as a reflex to act into a new situation.
The aim is to generate an abstract description of the task
reflecting user’s intention and modelling the problem
solution offline. In this context we also discuss the issues
of evaluation, tuning, suitable generalization and execu-
tion of the elementary skills [9,13-15]. Instead of using
radial basis as a function approximation [3,12], we used
inverse distance interpolation. It works like local ap-
proximators. We also used learning through feedback,
which improves the learning and generalization continu-
ously.
3. Controller Architecture
The proposed learning controller can be represented as
recurrent neural network architecture. The controller has
basically 3 layers, input layer, hidden layer and output
layer and a reinforcement controller which is connected
to the output layer and the hidden layer.
N
SR and
N
A
R are the sensor input and output. The weight is
updated when the sensor and the controller try to learn an
unknown model from sensors to actuators, i.e.
()
A
fS (1)
Input LayerInput layer consists of several neural in-
put nodes that represent some state in the state space i.e.
S. The sensor input is propagated through the hidden
layer from the input layer to the output layer. Each neural
node in the input layer is connected to all the neural
nodes in the hidden layer.
Hidden LayerHidden layer is composed of several
computational neurons. The output of the hidden neurons
is the distance between the input layer neuron and the
weight similar to a prototype fuzzy law. The weight
in the hidden layer neuron represents the centre i.e. the
demonstrated data points. Each neural node of the input
layer is connected to all the neural nodes of the hidden
layer. The output of the hidden layer codes the distance
of the input neurons from the hidden layer neurons.
i
W
Output layerThe output layer neuron computes the
inverse weighted sum of the output of the hidden
layer neurons i.e.
i
v
1
N
oi i
i
A
Wv

(2)
The output layer neuron uses inverse distance interpo-
lation to update the weights of the output layer.
oi
W
Figure 1. Neural network control architecture
The advantage of using this technique is that, the esti-
mated data point has the influence of all the neighbouring
data points depending on the distance from each proto-
type rule. This is the simplest generalized technique and
mathematically expressed as
(') (')
11
1
'
Nv Nv
i
1
p
p
ii
vv
dd






(3)
where, is the estimated control action in term of a
sensor state ; is a prototyped action of the hidden
layer neuron at state ;
'v
'
Si
v
S
p
d is the distance between S
and S’; p is the power; N(v’) is the number of data points
in the effective window of v’. Generally the distance can
be calculated by simple Euclidean distance formula. The
above generalization technique only acts as a function
approximation that reproduces the generalized control
field based on the distance from the existing prototype
rule. But to tackle uncertainty and to produce an im-
proved control field, the robot needs to incorporate a
weight that represents the significance of the previous
learning history and incorporates continuous learning
experiences. The idea is to improve the learning by trial
and error. So Q-learning [16] is incorporated, which pro-
duces Q-matrix based on the scalar reward R. The
Q-value determines the new policy to achieve the goal.
Therefore in the neural network architecture, the output
layer is connected to the reinforcement controller which
in turn is connected to the hidden layer. Depending on the
achievement of the desired goal, the reinforcement re-
ward R is given to the robot by the reinforcement con-
troller. The reward R updates the Q-matrix and the new
Q-value is used to generate a new control field. The rein-
Copyright © 2010 SciRes. JILSA
An Experience Based Learning Controller
82
forcement controller follows the online exploration and
offline learning policy [17]. So the Q-value is updated in
following two cases:
Case 1: Online explorationDuring online exploration
the robot uses Q-matrix to avoid uncertainty like un-
known obstacles. Therefore, whenever the robot finds
uncertainty in its next course of action, the control
switches from the generalized control field to Q-matrix
and selects an action with other maximum Q-value from
its effective window that corresponds to a policy with
next higher probability to reach the goal and avoids the
uncertainty. This earns an online reward R and updates
the Q-value online in that time step. The change in the
action in that time step is used as a prototype rule i.e.
center in the hidden layer.
Case 2: Offline learning policyIn the end of the trial
during offline learning policy, the reinforcement reward
R is given offline to the robot. The reward R updates the
corresponding Q-values of the trajectory based on the
final payoff. After updating Q-value, the change is ap-
plied to the neurons in the hidden layer of the network
architecture. So another factor that influences the output
of the hidden layer is the Q-value.
The Q-value is then used as a measure to calculate the
relevance of the data points in terms of usefulness of that
particular task geometry in achieving the goal. The change
in the Q-matrix influences all the neighbouring data
points in the control field. This is called the generaliza-
tion. After that the control field is transferred back to the
robot. In every trial Q-matrix is updated and reflects the
continuous learning. The steps are repeated until we get
the desired control field. This technique of learning is
known as online exploration and offline policy learning.
The reward scheme for online exploration and offline
policy learning is as follows:
1 If the goal is achieved or obstacle is aovided
0 If no obstacle or no goal
1 If failed to achieve goal or unable to aovid obstacle
R
(4)
The reward function can represented as:
uncertainty goal
RR R (5)
In online exploration
g
oal
R is set to zero and
is awarded online to avoid obstacles on its path whereas
in offline learning policy
uncertainty
R
g
oal
R is awarded offline and
is set to zero.
uncertainty
R
The Q update rule can be described in two steps.
Step 1: Give Reward R using (4) and (5) and update
the weight using the weight update Equation.
w
Q

 


''
,
,, ,
w
ww
Qsa
QsaRsaQsa Qsa

 

Here
,
w
Qsa is the weight matrix to be updated; (s,
a) is the current state and action; is the reward
matrix;

,Rsa
''
,
w
Qsa is the future state with the highest
value;
w
Q
is the learning rate;
is the discount factor in
the range [0, 1]. The weight update Equation (6) is inher-
ited from the traditional Q-learning algorithm [16].
Step 2: Estimate the data points based on the
calculated in step 1. The new estimate updates the previ-
ously generated control field and generates a new gener-
alized control field. To implement our idea we modify (3)
by introducing .
'vw
Q
w
Q
(') (')
11
'
Nv Nv
w
i
p
ii
Q
vv
dd






w
p
Q
(7)
Here w
p
Q
d is the weight. A small value repre-
sents the small contribution of the point in accomplishing
the goal where as a large value signifies large con-
tribution of the point in accomplishing the goal. is
inversely related to the distance, which signifies that the
will have higher value when the estimated data point
is closer to the good trajectories. The trajectories with
lower value have less contribution in the overall
generalization.
w
Q
w
Q
w
Q
w
Q
w
Q
,
w
(6)
In the proposed context behaviour is improved based
on scalar rewards from a critic. It does not require a per-
son to program the desired actions in different situations.
However several difficulties stand in the way. Firstly a
robot using the basic reinforcement-learning framework
might require an extremely long time to converge to the
right track and to acquire the right action. In our frame-
work learning from demonstration is used that reduces
the search space to achieve an adequate control and it
does not need any programming of the desired behaviour.
The demonstration provides the domain knowledge, that
gives hint to perform the desired task and reduces
knowledge base significantly. Through trial and error the
knowledge base can be improved further [18,19]. Sec-
ondly reinforcement learning is basically applied on the
task where the state of the environment is completely
observable, but in the real world tasks most of the
knowledge is incomplete and inaccurate. The online ex-
ploration and offline learning policy explores the un-
known environment with the help of generalized control
field and tackles the uncertainty by selecting the appro-
priate control action which signifies maximum experi-
ence i.e. maximum Q-value in the effective window in
achieving the desired goal. The trajectory generated by it
changes the control field and tackles the uncertainty in
the environment.
Copyright © 2010 SciRes. JILSA
An Experience Based Learning Controller83
Ideally for a real robot any learning algorithm should
be feasible and efficient to perform the complex compu-
tations, keeping in mind the robots internal configuration
under consideration. A robot has limited memory and
limited computational ability. Almost all the cases of
practical interest consist of far more state observation
pairs than could possibly entered into the memory. The
robot is even typically unable to perform enough compu-
tation per time step to fully use it. Especially in the real
world experiences, which consist of large number of ac-
tion and observation pairs and at each step the robot
needs to memorize and learn the action-observation pairs.
Thus a good approach is to compute and store a limited
number of data online which are relevant and learning the
policy offline. In the proposed algorithm the old data can
be used and new experiences can be collected, which are
somewhat characteristic for the task. The best thing is the
experience replay and the assessment of the performance
of the previous experiences. Another important step is the
generalization of the learned policy. The policy learned in
this way keeps the learning continuous. This kind of
learning is generally called as learning from experience.
For a robot controller, it is also important to reduce the
real time computation. The robot takes action on the gen-
erated control field only. Hence learning is made con-
tinuous and the computational overhead is also reduced.
On the other hand the exploration is done online. The
robot decides between exploration and exploitation based
on the number of times an action has been chosen. The
robot generally chooses action with the highest state ac-
tion value with high probability, which is known as ex-
ploitation. If the robot oscillates in a particular region
then exploitation is stopped and control switches to ex-
ploration to avoid oscillation and finds a new path. Dur-
ing exploration the selection of next state is made random.
This approach has a major advantage of not being influ-
enced by the limited knowledge only i.e. to avoid local
optima. During exploration the robot can take the control
actions which were never taken before and hence explor-
ing more of the unobserved environment. The balance
between exploration and exploitation is necessary, be-
cause exploitation is helping the robot to avoid uncer-
tainty and to reach the goal where as exploration is
avoiding all the oscillation and unnecessary iterations. So
this methodology has a very positive approach to learn
any new skill quickly. The robot can be restrained easily
to adapt to environment changes and trained and learned
improving the performance all the time.
The proposed algorithm can be summarized as below:
1) Build the control field by demonstrating the task, i.e.
initializing the weights in the hidden layer with the dem-
onstrated data points and imitating the relevant knowl-
edge required to achieve the task.
2) Based on the demonstrated control field, using in-
verse distance weight interpolation, generalize the field
using (7).
3) Transfer the generalized control field to the robot
and do some trials to achieve the goal. Do some explora-
tion also. If all the trials are successful in achieving the
goal then the demonstrated control field is perfect and
does not need any changes. If in trials, at some point the
robot hits or senses any obstacle then give reward
to the point using (4) and (5) and move to-
wards the point that has next highest Q-value, without
any obstacle. The next higher Q-value represents the next
most experienced and favorable point heading towards
the goal. Then update the Q-values using (6). This will
generate a new path to reach the goal. After the goal has
been reached, generalize the control field offline using
(7). The control field will have influence from the points
updated. This will reproduce an updated control field.
uncertainty
R
4) After reaching the goal, assign final reward
g
oal
R
to the robot using (4) and (5). This reward represents the
successful completion of the desired task avoiding all the
obstacles and is made offline. Update all the Q-values of
the updated control trajectory generated in step 3 using
(6).
5) Generalize the control field with the new updates
using (7). The updated field represents the new control
field. Repeat from Step 3 to 5 until the robot learns the
desired task.
4. Simulation Results
In simulation we build a discrete workspace with a di-
mension of 20 × 20. The workspace contains a goal and
obstacles, represented by the ‘G’ and the rectangles. Ro-
bot has a sensor of 3 × 3 neighbourhoods which means it
can sense obstacles around its 8 neighbouring cells. The
arrows represent the direction of each point in the field.
The simulation is done based on the proposed algorithm
in section 3. The robot was initially demonstrated to reach
the goal. Based on the demonstration initial control field
was built (Figure 2). The simulation considered two dif-
ferent tasks with different uncertain complexities with
same initial control field.
Task 1: The robot had to avoid an obstacle and reach
the goal.
Task 2: The robot had to avoid obstacles and pass
through the door and reach the goal.
In both the cases robot completed the tasks in few tri-
als avoiding all the uncertainties. The final control fields
for task 1 and task 2 are shown in Figures 3 and 4.
The learning history of the controller for both the cases
is described in Figures 5-7. The broken line curve and
the solid line curve represents first and second task.
In Figure 6 it can be observed that number of hidden
neurons increases with the number of trials. Increase in
hidden layer neurons indicates the increase in number
Copyright © 2010 SciRes. JILSA
An Experience Based Learning Controller
84
Figure 2. Demonstrated control field
G
Figure 3. Final control field of task 1
G
Figure 4. Final control field of task 2
centers or prototype rules in the network. It clearly indi-
cates the learning in each trial. Figure 7 displays the plot-
ting of the total number reinforcements used in each trial.
The total reinforcement can be defined as the sum of im-
mediate reinforcement the learner receives till the robot
reaches the goal and the offline reinforcement the learner
receives after reaching the goal. It is clear in Figure 7 that
after trial 3 the controller did not use any reinforcements
to complete the tasks. The controller needed only three
trials 1-3 to learn a new control law to avoid uncertainty
in both tasks. To see the reliability of the control field we
chose different starting points. The robot was successful
in completing the tasks from all the points.
Another observation is that the number of reinforce-
ment received (Figure 7) and the number of steps taken
Figure 5. Number of steps taken to reach goal in each trial
Figure 6. Number of hidden layer neurons used in each trial
Figure 7. Number of reinforcements given in each trial
Copyright © 2010 SciRes. JILSA
An Experience Based Learning Controller
Copyright © 2010 SciRes. JILSA
85
(Figure 5) in trial 1 in task 2 (solid line curve) is very
high. It indicates that, in task 2 the maximum exploration
was done in trial 1 where as in task 1 exploration was
increasing with trials. Therefore with increase in explora-
tion reinforcement is also increasing. Furthermore, it was
during first three trials the robot attained reinforcements
(Figure 7) and a sharp growth in network (Figure 6).
5. Conclusions
This paper presented a learning strategy to generate a
control field for a mobile robot in an unknown and un-
certain environment, which integrates learning, generali-
zation, exploration and offline computation into a unified
architecture. After the learning, a robot can approach the
goal by following the control field. The important lessons
learned from the implementation included 1) imitation of
very accurate and exact action sequence is not necessary
[15]; 2) a prior knowledge is required to plan a model of
the task to support rapid learning; 3) generalization is
improved by improving the learning policies; 4) simple
method like inverse distance is adequate to generalize the
task; 5) offline learning is an important method for real
time applications to avoid the large online computations;
6) online exploration is required to explore other possi-
bilities to perform a task and to improve the quality of
learning; 7) balance between exploration and exploitation
improves the learning policies, which reduces the learn-
ing time significantly; 8) the robot can learn from few
demonstrations but it effects the learning speed. The ma-
jor disadvantage of this method is the use of position in-
formation which is not always accurate in real robots due
to inaccurate sensor information due to rotational or
translational error caused by the slippage between robot’s
wheel and the floor.
REFERENCES
[1] M. N. Nicolescus and M. J. Mataric, “Natural Methods
for Robot Task Learning: Instructive Demonstrations,
Generalization and Practice,” In Proceedings of the Sec-
ond International Joint Conference on Autonomous Agents
and Multi-Agent Systems, Melbourne, Australia, July
14-18, 2003, pp. 241-248.
[2] M. N. Nicolescus and M. J. Mataric, “Learning and Inter-
acting in HumanRobot Domains,” IEEE Transactions
on systems, Man and cybernetics
Part A: Systems and
Humans, Vol. 31, No. 5, 2001, pp. 419-430.
[3] G. Hailu,“Symbolic Structures in Numeric Reinforcement
for Learning Optimum Robot Trajectory,” Robotics and
Automation Systems, Vol. 37, No. 1, 2001, pp. 53-68.
[4] D. C. Bentivegna, C. G. Atkeson and G. Cheng, “Learn-
ing Tasks from Observation and Practice,” Robotics and
Automation Systems, Vol. 47, No. 2-3, 2004, pp. 163-169.
[5] G. Hailu and G. Sommer, “Learning by Biasing,” IEEE
International Conference on Robotics and Automation,
Leuvem, Belgium, 1998, pp. 16-21.
[6] J. R. Millan, “Reinforcement Learning of Goal Directed
Obstacle Avoiding Reaction Strategies in an Autonomous
Mobile Robot,” Robotics and Autonomous Systems, Vol.
15, No. 4, 1995, pp. 275-299.
[7] A. Johannet and I. Sarda, “Goal-Directed Behaviours by
Reinforcement Learning,” Neurocomputing, Vol. 28, No.
1-3, 1990, pp. 107-125.
[8] P. M. Bartier and C. P. Keller, “Multivariate Interpolation
to Incorporate Thematic Surface Data Using Inverse Dis-
tance Weighting (IDW),” Computers & Geosciences, Vol.
22, No. 7, 1996, pp. 795-799.
[9] H. Friedrich, M. Kaiser and R. Dillmann, “What Can
Robots Learn from Humans,” Annual Reviews in Control,
Vol. 20, 1996, pp. 167-172.
[10] S. Thrun, “An Approach to Learning Mobile Robot Navi-
gation,” Robotics and Automation Systems, Vol. 15, 1995,
pp. 301-319.
[11] M. Kasper, G. Fricke, K. Steuernagel and E. von
Puttkamer, “A Behaviour Based Learning Mobile Robot
Architecture for Learning from Demonstration,” Robotics
and Automation Systems, Vol. 34, No. 2-3, 2001, pp.
153-164.
[12] W. Ilg and K. Berns, “A Learning Architecture Based on
Reinforcement Learning for Adaptive Control of the
Walking Machine LAURON,” Robotics and Automation
Systems, Vol. 15, No. 4, 1995, pp. 321-334.
[13] H. Friedrich, M. Kaiser and R. Dillmann, “PBD-The Key
to Service Robot Programming,” AAAI Technical Report
SS-96-02, American Association for Artificial Intelli-
gence, America, 1996.
[14] H. Friedrich, M. Kaiser and R. Dillmann, “Obtaining
Good Performance from a Bad Teacher,” In Programming
by Demonstration vs. Learning from Examples Workshop
at ML’95, Tahoe, 1995.
[15] R. Dillmann, M. Kaiser and A. Ude, “Acquisition of Ele-
mentary Robot Skills from Human Demonstration,” In
International Symposium on Intelligent Robotics Systems,
Pisa, Italy, 1995, pp. 185-192.
[16] R. S. Sutton and A. G. Barto, “Reinforcement Learning:
An Introduction,” The MIT Press Cambridge, Massachu-
setts, 1998.
[17] B. Bakker, V. Zhumatiy, G. Gruener and J. Schmidhuber,
“A Robot that Reinforcement-Learns to Identify and
Memorize Important Previous Observations,” Proceed-
ings IEEE/RSJ International Conference on Intelligent
Robots and Systems, Las Vegas, USA, October 27-31,
2003, pp. 430-435.
[18] A. Billard, Y. Epars, S. Calinon, S. Schaal and G. Cheng,
“Discovering Optimal Imitation Strategies,” Robotics and
Automation Systems, Vol. 47, No. 2-3, 2004, pp. 69-77.
[19] S. Russell and P. Norvig, “Artificial Intelligence: A Mod-
ern Approach,” Second Edition, Prentice Hall, New Jersey,
USA, 2002.