Intelligent Control and Automation, 2011, 2, 214-225
doi:10.4236/ica.2011.23026 Published Online August 2011 (http://www.SciRP.org/journal/ica)
Copyright © 2011 SciRes. ICA
Motor Learning Based on the Cooperation of Cerebellum
and Basal Ganglia for a Self-Balancing Two-Wheeled Robot
Xiaogang Ruan, Jing Chen, Lizhen Dai
School of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China
E-mail: chenjing0828@139.com
Received May 23, 2011; revised May 30, 2011; accepted June 7, 2011
Abstract
A novel motor learning method is present based on the cooperation of the cerebellum and basal ganglia for
the behavior learning of agent. The motor learning method derives from the principle of CNS and operant
learning mechanism and it depends on the interactions between the basal ganglia and cerebellum. The whole
learning system is composed of evaluation mechanism, action selection mechanism, tropism mechanism. The
learning signals come from not only the Inferior Olive but also the Substantia Nigra in the beginning. The
speed of learning is increased as well as the failure time is reduced with the cerebellum as a supervisor.
Convergence can be guaranteed in the sense of entropy. With the proposed motor learning method, a motor
learning system for the self-balancing two-wheeled robot has been built using the RBF neural networks as
the actor and evaluation function approximator. The simulation experiments showed that the proposed motor
learning system achieved a better learning effect, so the motor learning based on the coordination of cere-
bellum and basal ganglia is effective.
Keywords: Motor Learning, Cerebellum, Basal Ganglia, Operant Learning, Self-Balancing Two-Wheeled
Robot
1. Introduction
Many skills and behaviors of human or animals are
formed and developed gradually in the self-learning and
self-organization process of nervous system. It is im-
portant to understand and imitate the internal learning
mechanism of human and animals’ nervous system, and
it is an important issue to give such a mechanism to ma-
chine for control science, artificial intelligence and ro-
botics study. Opposite with passive machinery reception
learning, autonomous learning is an active self-knowl-
edge and independent learning, and it is relatively the
basis of agent’s long-term learning and development.
The interdisciplinary across of psychology, neurophysi-
ology and machine learning theory promoted the produc-
tion and development of autonomous learning theory and
method directly.
Related researches of neurophysiological theories indi-
cate that, during the sensorimotor learning, the cerebral
cortex, basal ganglia, and cerebellum work in parallel
and unique way [1-3]. The ballpark behavior adapt to the
environment can be learnt through the basal ganglia loop
by reinforcement learning [4], and the ballpark behavior
can be refined by supervised learning in the cerebellum
[5]. Cerebral cortex, utilizing unsupervised Hebbian
learning [6-7], driven by input from basal ganglia and
cerebellum, learns through practice to perform these op-
erations fast and accurately. Studies show that the basal
ganglia are the medium of behavior selection, and nu-
cleus in basal ganglia, such as striatum, globus pallidus,
substantia nigra, et al., play the role of action se- lection
[8-11]. Bernard W. Balleine, et al, from psychology and
brain research institute, university of California, has
studied the integrative function of basal ganglia in oper-
ant learning [12], which indicates that the dorsal striatum
in basal ganglia plays an important role in the operant
conditioning learning process. From above introductions,
it can be seen that human brain is a complex system, and
the basal ganglia and cerebellum in the central nervous
system all play an important role in motor learning. For
motor learning, two different mechanisms can operate to
make use of sensory information to correct motor errors.
The fist is on-line error correction, and the second is
trial-to-trial learning, in the latter case, errors from one
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
215
trial influence the motor output on the next trial. The two
learning mechanisms are related with the cerebellum and
basal ganglia.
However, the motor control and learning mechanism
of cerebellum and basal ganglia seems be studied sepa-
rately. For example, the cerebellum works as an adaptive
filter [13,14] and the feedback error learning in cerebel-
lum is proposed by Kawato [15,16]. An important class
of BG model is known as Actor-Critic models or
RL-based models [17,18] with system-level. In 2008, a
contracting basal ganglia model was proposed for the
action selection mechanism [19]. Although Doya [20]
pointed the complementary roles of cerebellum and basal
ganglia, how to implement such cooperation is not very
specific. This paper is dedicated to the coordinated
learning mechanism between cerebellum and basal gan-
glia.
In the aspect of learning control theory, professor
Wang proposed a kind of hybrid algorithm aiming at the
obstacle avoidance problem of robot [21]. They proposed
a neural fuzzy system with mixed coarse learning and
fine learning phases. In the first phase, supervised learn-
ing method is used to determine the membership func-
tions for the input and output variables simultaneously.
After sufficient training, fine learning is applied which
employs reinforcement learning algorithm to fine tune
the membership. Meng Joo Er et al. also applied hybrid
learning approach for the obstacle avoidance [22]. In the
simulation environment, preliminary supervised learning
was executed applying dynamic fuzzy neural net- works,
and in the real robot, neuro-fuzzy controller is capable of
re-adapting in a new environment. After carrying out the
learning phase on a simulated robot, the controller is
implemented on a real robot. The hybrid learning method
shows the coordination between supervised learning and
reinforcement learning.
In the aspect of learning algorithm research, Michael
et al. combined supervised learning with actor-critic
learning and proposed a supervised learning combined
with an actor-critic architecture [23]. This type of learn-
ing has been suggested by Clouse and Benbrahim [24,25].
In this paper, the supervised actor-critic algorithm is as-
sociated with the interactions between the cerebellum
and basal ganglia. At the same time, operant learning
will be considered in the design of motor learning system
as the action selection mechanism in the basal ganglia.
The example of motor learning in a self-balancing robot
will be present to show the benefits of using the pro-
posed motor learning method.
There are many kinds of two-wheeled self-balancing
robot, such as wheeled-inverted pendulum, self-balanc-
ing wheelchair, JOE, nBot [26-29], and so on. All of
them have ability of keeping balanced themselves and
running forward. For the balancing control, researchers
have done a lot of works for such kinds a robot. An
adaptive output recurrent cerebellar model articulation
controller is utilized to control wheeled inverted pendu-
lums (WIPs) [30]. Grasser et al. [28] developed a dy-
namic model for designing a mobile inverted pendulum
controller using a Newtonian approach and linearization
method. In [26], a dynamic model of a WIP was created
with wheel motor torques as input, accounting for non-
holonomic no-slip constraints. Vehicle pitch and position
were stabilized using two controllers. Jung and Kim [31]
created a mobile inverted pendulum using neural net-
work (NN) control combined with a proportional-inte-
gral-derivative controller. But there are fewer reports for
the autonomous learning based on the neuro-physiologi-
cal theory. Balancing skill’s acquisition is the results of
central nervous system. How to imitate the brain function
and construct a motor learning method for balancing
robot is the main target in this paper.
This paper is organized as follows. Section 2 addresses
the motor learning in the central nervous system of hu-
man being and shows the important role of the cerebel-
lum and basal ganglia in CNS. Section 3 introduces op-
erant learning method. That is an important working
style in the motor learning and it is closely related with
the mechanism of the basal ganglia. Section 4 is the main
part of this paper. Motor learning method with the in-
teraction of the cerebellum and basal ganglia is proposed.
Evaluation mechanism, action selection mechanism and
tropism mechanism are included. The convergence analy-
sis is deduced in the sense of entropy. At last, in section
5, the control structure is designed with the proposed
method for a self-balancing two-wheeled robot and the
comparative experiments are conducted to show the ef-
fectiveness of proposed method. Section 6 concludes.
2. Motor Learning in CNS
An important feature of the central nervous system (CNS)
in its control of movement is the capability of motor
learning. Motor learning is the process of improving the
motor skills, the smoothness and accuracy of movements.
It is important for calibrating simple movements like
reflexes, as parameters of the body and environment
change over time. As we know, for higher mammals,
especially humans, both of supervised learning and rein-
forcement learning are probably the most important class
of motor learning. In the CNS, two auxiliary monitor
systems are the cerebellum and ganglion nuclear group
in the forebrain area. Cerebral cortex, cerebellum and
basal ganglia, cooperated with each other, form complex
motor control system. Figure 1 shows their relations in
CNS [32]. Cerebellum specialize in supervised learning
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
216
CC
ThalamusBG
CB BS
SC
Sensory input
Figure 1. Schematic of relationship for central nervous
system (CNS). Cerebral Cortex (CC); Basal Ganglia (BG);
Cerebellum (CB); brainstem (BS); spinal cord (SC). In the
CNS, BG and CB play an important role in motor control
and learning. The thalamus relays sensation, spatial sense
and motor signals to the cerebral cortex.
and its training signals derive from the climbing fiber
input from the inferior olive, as shown in Figure 2. Basal
ganglia are involved in reinforcement learning with the
operant conditioning role and the learning signals are
dopaminergic signals from Substantia Nigra. The neural
circuit of basal ganglia is shown in Figure 3.
3. Operant Learning
3.1. The Related Works about Operant Learning
In psychology, operant is a class of behavior that pro-
duces consequences by operating (i.e., acting) upon the
environment. Operant conditioning (OC) is a technique
of behavior modification through reinforcement and
punishment. The research about operant conditioning
theory [34] was started in 1938 by Skinner, a psychology
professor. Its consequence influences the occurrence and
form of behavior. Operant conditioning and classical
conditioning [35] are two main learning ways of associa-
tive learning, and all animals, including human, have
these two learning way. Operant conditioning is distin-
guished from classical conditioning in that operant con-
ditioning deals with the modification of operant behavior.
Operant conditioning reflects the relation between be-
havior and its outcome, and the learning with OC theory
is called operant learning (i.e. instrument learning). Re-
cently, researchers apply OC theory in the robot learning
and control and have done plenty of experimental studies.
For example, Björn Brembs et al. [36] from Germany
applies himself to the research of the operant condition-
ing in flies (Drosophila) and snails. 'Pure' operant condi-
tioning and parallel operant conditioning at the flight
simulator were studied. Chengwei, Yao et al. [37] ap-
plied OC theory into emotion development and presented
an emotion development agent model based on OCC
inferior olive
granular
cells
Purkinje
cells
mossy fibers
Cerebellar Cortex
climbing fibers
brainstem
Effecto
r
Cerebral
cortex
parallel fibers
Figure 2. Working principle of cerebellar cortex. Climbing
fibers carry the error signals to modify the PF-PC (Parallel
fibers-Purkinje cells) synapse through LTD (long-term
depression) [33].
Cerebral Cortex
Striatum
striosome matrix
SNc SN r/GPi
Thalamus
STN
GPe
DA
Figure 3. Neural circuit of the basal ganglia. Substantia
nigra pars compacta (SNc); substantia nigra pars reticulate
(SNr); internal segment of globus pallidus (GPi); external
segment of globus pallidus (GPe); subthalamic nucleu
(STN); dopamine (DA); , excitatory connection; , inhibi-
tory connection. DA from substantia nigra is feedback to
striatum as the learning signals.
Model and operant conditioning. In, Kazuko Itoh et al.
[38] in Japan built a behavior model of humanoid robots
based on operant conditioning, where the robot could
select and output its behavior according to its behavior
potential and evaluation of the behavior. Robot WE-4RII
learned the behavior of shaking hand with people using
his right hand. Bernard et al. [12] had studied the integra
tive function of BG with operant conditioning, which
suggested that the dorsal striatum in BG play an impor-
tant part in operant learning. That is why the OC theory
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
217
is considered in the research of BG mechanism. The
principle of OC is present in the following section.
3.2. The Principle of Operant Conditioning
Theory
Operant conditioning emphasizes that behavior's con-
sequence influences the occurrence. The difference be-
tween operant learning and supervised learning depends
on the feedback signal obtained from environment,
which is valuation signal or error signal. There are three
main elements about the learning control based on OC
theory, including probabilistic action selection mecha-
nism, evaluation mechanism and tropism mechanism,
which is the main part of learning and is used to update
the action selection policy. Figure 4 shows the learning
control mechanism based on OC theory. The probabilis-
tic action selection mechanism is the main part of OC
learning.
4. CB-BG-Based Motor Learning Scheme
4.1. CB-BG-Based Motor Learning Control
Structure
According to the working modes of the cerebellum and
basal ganglia, a CB-BG-based motor learning scheme is
presented, which exploits supervised ac- tor-critic learn-
ing structure, as shown in Figure 5.
The coordinated factor
is used to calculate the
combined action a, which can be got through the fol-
lowing formula.
1
E
S
aa a

 (1)
Thereinto,
E
a, the exploratory action, can be got by
an artificial neural networks addition to a probabilistic
action selection mechanism.S
a, the supervised action,
can be got by the traditional feedback controller, which
provides intention and solution for the action networks.
a, a simple weighted sum of the actions given by the
component policies, is sent to the environment.
4.2. Algorithm Description
1) Evaluation Networks with BG mechanism: The
evaluation function

Vs can be updated with TD er-
ror[39]:
 
 
1tt
trt VsVs

 . t
z is the eligi-
bil- ity trace of evaluation networks weights, which is
com- puted by the linear differential equation, defined as,

1
t
tt
t
Vs
zz W


(2)
The updating equation of evaluation networks based
()tu
()tX
(+1)
(, (t))
t
ru
XX
Figure 4. Learning control mechanism based on OC theory.
Supervisor
Environment
Combined
Action
Coordinated
factor
State s
Rewarded signal r
Evaluation Mechanism
AN
S
a
a
a
IO
BG CB
SN
Probabilistic
Action
Selection
A
a
Tropism
Mechanism
BG
CF
S
e
Figure 5. CB-BG-based motor learning control structure.
BG: Basal GangliaCB: Cerebellum; IO: Inferior Olive; SN:
Substantia Nigra; CF: Climbing Fibre; AN: Action
Net-works, which is constituted with the interaction of BG
and CB; Evaluation Mechanism owns to BG. The real line
is the data flow, while the dashed line part is the learning
updating algorithm. Dopaminergic signal (δ) coming from
SN and feedback to striatum in BG is the learning signal for
the action network in BG part. Another learning signal is
the supervised error S
e coming from the Inferior Olive
(IO) in correspondence with the learning mechanism of the
cerebellum. The two parts of learning signals can be
coordinated by coordinated factor
Probabilistic Action
Selection (PAS) mechanism expresses the operant learning
in BG.
E
a, the exploratory action; S
a, the supervised
action; r represent the reward signal after executing the
combined action; state s represent the input of action
networks and evaluation networks.
on eligibility trace is that

t
tttz

ww (3)
0
is a positive constant and express the learning
step of evaluation networks; 01
 is the discount
factor of reward; 01
is the decay factor of eligi-
bility trace. If
0t
(i.e., the state is better than ex-
pect), the estimated critic value

Vs would be in-
creased. In the contrast, if

0t
(i.e., it is worse
than expect), the estimated value

Vs in the state
would be decreased.
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
218
2) Action Networks with the interaction of CB and BG:
In Figure 5, the actor policy

πA
s
is a mapping from
state to actor and can be given by a function approxima-
tor with the parameter vector
.
π,
AA
as
(4)
The random action selective strategy
E
a obeys a
probabilistic distribution of Boltzmann-Gibbs [40].
 
1exp E
E
B
a
a
Z
KT


(5)
Thereinto,

2,0
EEA
aaaT
  is the thermo-
dynamics temperature, 23
1.3810Js K
B
K
 is the
Boltzmann constant,

exp EB
aKT
is Boltzmann
factor,
Z
is the partition function, and that:

expd π
E
EB
B
a
Z
aKT
KT



 


(6)
After each state transition, those parameters are modi-
fied according to the following rule. It includes two parts,
CB
is realized through the function of cerebellum,
and BG
is correspondence with the basal ganglia
part.

BG CB
1
 
  (7)
In order to compute the supervised learning update:
CB
, the supervisory error can be defined for each ob-
served state as follows.
 
2
1ππ
2SA
Ess s

(8)
The weights updating of the action networks in cere-
bellum part is that


CB π
SA A
Esa as

 
  (9)
The actor networks updating in BG part applies a kind
of approximated policy gradient estimation algorithm:

BG π
EA A
aa s

 (10)
0
is the learning step of action networks;
01
 weighs between two kinds of gradient infor-
mation, one is related with second evaluation signal
(also called internal return) and another gradient infor-
mation is related with the supervised error

SA
aa. In
order to express the importance of rewarded signal, in-
ternal return signal is introduced into the item


CB π
SA A
aa s

 , so we get the improved
weights updating method.


π
ES SAA
aa aas


  

(11)
The improved supervised actor networks weight-up-
dating algorithm shows that the TD error modulates the
supervisory error

SA
aa. That is to say, one ‘trusts’
the critic more than the supervisor, in which case one
should view the supervisor as an additional source of
exploration. That is the coordination between the cere-
bellum and basal ganglia. When the internal return
is
small, the actor network loses the ability of learning
through the supervisor. While in the unimproved algo-
rithm, this learning ability only depends on the change of
coordinated factor ω.
3) Tropism Mechanism: In [41], the temperature
parameter defined by the state values and the frequency
of visiting states are added to representative state-action.
By this thought, the probabilistic tropism can be changed
through the temperature T, which depends on the critic
value
Vs. The bigger is the estimated value
Vs,
the better the action tropism is, and the temperature is
more smaller, and vice versa. The smaller
Vs, the
worse tropism, the higher temperature, and system is in
much more stochastic situation. The tropism rule is:


1
2
1exp t
k
TkV s
(12)
Thereinto, 12
0, 0kk, are constant and 1
k is de-
fined as Simulated Annealing factor(SAF), which can
be changed with learning;
t
Vs is the critic value of
current state.
4) Interaction Mechanism of the cerebellum and basal
ganglia: The coordinated factor
expresses the wei-
ght of cerebellum supervisory learning in the whole
learning process. It can be changeable and reflect the role
of cerebellum and basal ganglia in different motor learn-
ing stages. The following exponential increase method
can be used as in literature [42] shown in Figure 6.

2
23
3
max 0,1exp,0
tn nn C
n



 





(13)
5) CB-BG-based Motor Learning Algorithm with the
RBF Networks: A radial basis function (RBF) network,
which is an artificial neural network that uses radial basis
functions as activation functions, is proposed by J.
Moody and C. Darken in the late 1980s. It is a linear
combination of radial basis functions. RBF networks
imitate the neural networks structure of local adjustment
and mutual covered receptive fields in human brain. The
0500 1000 15002000 2500 3000
0
0.2
0.4
0.6
0.8
1
t/s
Exponential increase
ω
0500 1000 15002000 2500 3000
0
0.2
0.4
0.6
0.8
1
t/s
Exponential increase
ω
Figure 6. CB coordination factor.
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
219
radial basis function can be used for the coding of
CMAC. The conventional CMAC uses constant binary
or triangular receptive-field basis function. However, the
derivative information is not preserved. To acquire de-
rivative information of input and output variable, Chiang
and Lin developed a CMAC with a differentiable Gaus-
sian receptive-field basis function and provided conver-
gence analysis results for this network [43]. As men-
tioned above, we will approximate action and evaluation
functions by RBF neural networks in our proposed algo-
rithm.
The three layers topological structure of RBF net-
works is shown in Figure 7, which is introduced in detail
as follows.
The first is the input layer, which is for input space.
Each neuron is corresponding to an input variable,
and given that

T
12
,,, n
n
xx xx.
The second is the hidden layer, in which every node
is a Gaussian function. The vector formed by output
of hidden layer,

B
T
12
,,,
n
 
, B
n is called
the resolution. The bigger B
n has a higher resolu-
tion.

2
2
1
exp 2
nkki
i
kki
xc
b





(14)
where,

T
12
,,, n
iii ni
cc cc is the center vector;


T
12 B
,,,, 1,2,,
n
iii ni
bb binbis the base
width vector.
The third is the output layer, which have one node,
used for approximating the unknown nonlinear func-
tion. The weights vector is: B
T
12
,,,
n
vv v


v, and
the output is got.
B
out
1
n
T
ii
i
yv

v
(15)
For the action and evaluation networks, the same
structure of RBF neural networks is used. The output
weights vector of actor and evaluation networks are rep-
State
2
2
1
()
exp[ ]
2
n
kki
i
kki
xc
b

1
v
2
v
3
v
B
n
v
out
y
(,)
ki ki
cb
1
()
x
t
()
i
x
t
()
n
x
t
Figure 7. Architecture of a RBFN.
resented with
and w respectively. So the process
of CB-BG-based motor learning algorithm with the RBF
networks can be concluded as follows:
CB-BG-based Motor Learning Algorithm:
Step 1: Initialization: Initiate the learning step α, β;
discount factor
0, 1
; eligibility trace delay factor
0, 1
; Initiate arbitrarily all weights parameters
and w; Let 0trytime
.
Step 2: Repeat (for each episode):
1) t0z00 (reinitialize eligibility trace), Choose
an initial state 0
s
.
2) Repeat (for each step in the episode):
a) For current state t
s
, calculate
π
AAt
as;
π
SSt
as; Select exploration action
E
a ac-
cording to Boltzmann distribution; set the coor-
dinated factor
; compute the combined action
a and the eligibility trace
1tt t
zz s


.
b) Execute action a, receive immediate reward
rt,
then observe the new state 1t
s
; calculate TD er-
ror

1tt
trt Vss

 ; update the
weights through
 

1t
tt
tz

ww ; until
satisfied the condition


1,tt
 ww ;
 

1
E
SSAt
tt aa aas
 

 

 
c) 1tt
;
Until the desired number of episodes has been investi-
gated.
4.2. Algorithm Convergence Analyses
Theorem 1: Action
E
a obeys probability distribution
of Boltzmann-Gibbs:
 
1exp E
E
B
a
a
Z
KT




, where,

2
EEA
aaa
 ,
expd π
E
EB
B
a
Z
aKT
KT



 


The sum of the
probability in real sets is one.
Proof: The sum of the probability in real sets can be
compute in the following integral equation.
 

1
dexp d
π
1exp d1
ππ
E
p
EE E
B
EB
E
B
BB
a
aa a
ZKT
aKT
a
KT
KT KT
 
 



 







Theorem 2: Along with learning, the entropy of sys-
tem:
 

2
log d
E
EE
Ea aa




decreases grad-
ually, and system changes from uncertainty to certainty
as time.
Proof: The sum of the probability in real sets can be
compute in the following integral equation.
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
220
Literature [44] listed several entropy of continuous
probability distribution, and the entropy of continuous
system with obeys normal distribution is:

2
2
1log 2π
2
Ee
. It can be got that the probability
distribution of Boltzmann-Gibbs in the paper is consis-
tent with normal distribution: ,2
B
S
K
T
Na




, that is to
say, the mean is S
a
and the variance
2
2
B
K
T
Z
 .
So that, the entropy of the continuous system is:

2
1log π
2B
EeKT (16)
Along with the learning progress,

Vs approxi-
mates the largest sum of discounted reward. According
to the tropism mechanism as formula (12), and by setting
suitable parameters, the explore and exploit can keep
coordinated, and the action policy which has been learnt
at last can satisfy

ππ
πarg maxEE
EVs
. The learning
algorithm makes the function

Vs developing toward
the maximum direction. From the tropism mechanism,
we can find that T would decrease with the increase of V.
The system transforms from uncertainty to certainty
along with learning.
Entropy of discrete system has better characters, such
as, non-negative characteristic and coordinates transfor-
mation invariability, etc. Different with discrete system,
entropy of continuous system, also called cross-entropy
or differential entropy, does not have characteristics
above. When
 
21
1π8.485510K ,
B
TeK 0E
,
the entropy could be negative as shown in Figure 8.
Therefore, the proposed method is convergence in the
sense of entropy, and the system changes from uncer-
tainty to certainty.
5. Simulation Experiment and Analyses
5.1. Self-Balancing Two-Wheeled Robot
Self-balancing two-wheeled robot is a complex nonlinear
under-actuated system and has self-balancing skill like
human. There is important theory signification to re-
search about the cognition mode of robot. The schematic
structure is shown in Figure 9.
Parameters are defined in Table 1 which is shown in
Appendix.
The dynamic mathematic model is

,
M
qq NqqET
 
  (17)
where,

T
,,
bl r
q

,

T21
lr
T

, the ma-
00.5 11.5 2
x 1023
-2
-
1
0
1
2
3
T/K
Entropy / (bit)
(1/(eKB), 0)
E>0
E<0
00.5 11.5 2
x 1023
-2
-
1
0
1
2
3
T/K
Entropy / (bit)
(1/(eKB), 0)
E>0
E<0
Figure 8. Entropy changing with temperature T.
l
r
b
Figure 9. System structure of robot.
Table 1. Definition of Parameters.
Parameter Description Value
,
lr
mm kg mass of the left and the right
wheel 1
M/kg the mass of the intermediate
body 10
R/m radius of wheel 0.15
d/m distance between the 01 and 0r 0.44
L/m distance between the 0 and G 0.4
Jl = Jr = Jw/kg·m2 the moment of inertia of (left
and right wheel) about its axis 0.01125
Jy/kg·m2 the moment of inertia of the
robot about the y –axis 1.5417
Jz/kg·m2 the moment of inertia about the
z –axis 0.5893
g/m/s2 acceleration of gravity 9.8
,
lr
/(N·m/(rad/s)) frictional coefficient of two
wheels with ground 0.1,0.1
w
/(N·m/(rad/s)) frictional coefficient of the
wheels axis 0.1
x/m forward displacement of robot variable
v/(m/s) forward velocity of robot variable
b
/rad inclination angle of the inter-
mediate body variable
b
/ rad/s inclination angle velocity of the
intermediate body variable
,radrads

yaw angle and angle velocity of
robot variable
,
rl
/rad rotary angle of the right and left
wheel variable
,
lr
/ rad/s rotary angle velocity of the
right and left wheel variable
l
/N.m torque provided by left motor [–5, 5]
r
/N.m torque provided by right motor [–5, 5]
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
221
trix is illustrated in appendix.
According to the definition of under-actuated system,
the affined nonlinear system can be formulated with

12
,, ,,qfqqtfqqt u
 
 
, and if the following condi-
tion is satisfied:


2, ,dimrankfq q tq


  (i.e. the
number of actuators is less than DOF, degree of free-
dom)the system is an under-actuated system. As for the
two-wheeled self-balancing robot,

123rankMq E



, so that it is an under-actuated
system. Equation (16) is left multiplied transformation
matrix A,


,
A
Mqq ANqqAET
 
 , where,
111
010
001
A





. Then we can get:

,qqqq ET
 
 
  (18)
where,
T
010
001
EAE




Equation (18) can be divided into:


11 11221
21 12222
,0
,
qq qq
qq qq
 
 
 
 
 
 
(19)
where, 12
,, ,
nl m
qq qlnm
 , 1
q
denote
all the passive joints, and 2
q
denote the actuated joints;
denotes the Coriolis force and gravity terms.

11 12
21 22
q





, where
is positive definite,
11
and 22
are also positive definite. Through the
controllability analysis, we can conclude that the system
is controllable. So it is important that under-actuated
system is not necessarily an uncontrollable system.
5.2. Simulation Experiment
Applying the motor learning method proposed in Section
IV, we did a simulation research on a self-balancing
two-wheeled robot for the balance skill learning, and the
bionic learning control structure is illustrated in Figure
10. There are mainly five parts in the control system:
receptor, afferent nerve, nerve centre, efferent nerve and
effector. The proposed CB-BG-based motor learning
method in Section 4 is included in the nerve centre. DAS
denotes data acquisition system.
According to the sliding mode control thought, we
define the rewarded function and the sliding surface H as
follows:

:0Hxsx (20)

12bb
s
ta tat


(21)
We can obtain the reward information according to
the changing of the sliding surface. After executing an
action, if
s
t changes near to zero, give it reward.
Conversely, give it penalty. 0 denotes reward, and -1
denotes penalty. It can be formulated as:
 
22
0(reward) if1
1(penalty)else
t
st st
r
(22)
Detect the robot states signals, and normalize the input
signals which are sent to the CNS including the cerebel-
lum and basal ganglia. The coordinated factor
is
used in the form of exponential decay as formula (13).
The parameters are chosen as follows.
Tis the sampling
time. ,
bb
nn
are the normalizing factor for angle and
angle velocity separately.


12
231
2
T
12
T
12
49 1491
10; 2; 49; 10; 0.1; 0.1;
0.1; 0.8; 0.8; 500; 200; 10;
1; π54; π54;
3,2, 1,0,1,2,3;
0.5,0.5,0.5,0.5,0.5,0.5,0.5
;;;
5
bb
sB
acll
S
Tmsnnk k
nna
an n
cc
bb
wRwR a
a



  
 
 


 
w

0102 2
bb llll

 
The initial inclination angle

00.1744
brad
.
1) The first experiment: two experiments are done for
contrast. One with cerebellum coordinated, and another
without cerebellum coordinated. Observe the result of
inclination angle b
.
Enco-
der
DAS
DFS
(data fusion system)
Left
wheel
Gyro Incli-
nator
Body
posture
Enco-
der
Right
wheel
bb llrr
  

Robot
Motor servo module
D/A
Converter
Cere b e llu m
Basal Ganglia
CB-BG-based Motor
learning method
RECEPTORAFFERENT NERVENERVE CENTRE
EFFERENT
NE RVE
EFFECTOR
Robot States
,
lr
Figure 10. The bionic motor learning structure for self-
balancing two-wheeled robot.
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
222
From the results, we can conclude that robot can learn
to choose suitable action intelligently and keep balance
with the two methods. The different is that, with the co-
ordination of cerebellum, the robot can achieve the ex-
pected result with 0 failure times, shown in Figure 11(a).
However, the learning method with only basal ganglia
mechanism, the robot has failed for 8 times and got the
correct action in the ninth try time, and the Figure and
data analysis chart are shown in Figures 11(b) and (c).
2) The Second Experiment: After learning with
cerebellum coordinated for one try time, robot can keep
balance through operant learning only. Adjust the Simu-
lated Annealing factor (SAF) 1
k, and observe the
inclination angle changing.
The process of annealing is as follows: In the begin-
ning, the SAF: 110k; on the 11th turn learning:
11
10 5kk; on the 16th turn learning:
11
51kk ; on the 21st turn learning:
11
10.5kk ; on the 23rd turn learning:
11
0.5 0.1kk.
The average temperature and entropy for every turn
have been calculated as shown in Figure 12. It can be
found that the entropy is decreasing with learning. Ob-
serve the balancing learning results and the inclination
angle curves are illustrated in Figure 13, and the con-
vergence results have been contrasted in the beginning
and the end in Figure 14. From the experiments results,
the convergence can be seen with the decreasing of tem-
perature.
From above two experiments, it can be shown that,
using the proposed CB-BG-based motor learning method,
the self-balancing robot can learn the balancing skill high
efficiently. Through the decreasing the temperature,
learning results have become better and better. And the
decreasing of entropy also indicates that the system is
developing from uncertainty to certainty.
6. Conclusions
For the behavior learning of agent, a novel motor learn-
ing method has been present. That is based on the coop-
eration of the cerebellum and basal ganglia. The evalua-
tion mechanism, tropism mechanism, action selection
mechanism and the learning algorithm with the cerebel-
lum coordination mechanism have been proposed. In the
beginning, the learning signals come from two parts (In-
ferior Olive and Substantia Nigra). At last, the learning
signals only come from Substantia Nigra. Through
analysis, the algorithm is convergent in the sense of en-
tropy. With a self-balancing two-wheeled robot as the
experiment plant, a motor learning control system has
been constructed for the robot with the proposed method,
in which RBF neural network is used to approximate the
0 2 46 810 12 14 16 18 20
-20
-10
0
10
20
30
40
50
60
t/s
incli nati on angl e / (deg re e)
(a)
0 2 4 6 810 12 14 16 18 20
-100
-80
-60
-40
-20
0
20
40
t/s
i
nc
li
na
ti
on ang
l
e
/
(d
egree
)
(b)
(c)
Figure 11. Results of comparative experiments. (a) Simu-
lation result with the proposed motor learning method; (b)
Simulation result with the traditional actor-critic method;
(c) Data analysis with unimproved method.
05 1015 20 25 30
0
2
4
6
Turn
Entropy ( Bi t )
5
0
-5
Temperat ure (K)
05 1015 20 25 30
-10
Turn
24
x 10
Figure 12. Temperature and entropy changing with learning.
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
223
00.511.5 22.5 3
-6
-4
-2
0
2
4
6
8
10
t/s
inclination angle
b
/degree
The 2nd decr easin g t em per at ur e Entr opy: 4.5409
The 10t h decr easin g t em per at ur e Ent r opy: 4. 5403
The 11t h decr easin g t em per at ur e Ent r opy: 3. 5372
The 16t h decr easin g t em per at ur e Ent r opy: 1. 2160
The 21st decr easin g temper at ur e Entr opy: 0.2178
The 23r d decr easin g t em per at ur e Entropy:- 2.1094
Figure 13. Changing of inclination angle with the decreas-
ing of temperature.
Figure 14. Convergence process of learning.
action and evaluation networks, use the sliding mode
method to design the reward function, use the exponent-
tial decay factor to design the cerebellar effect in the
whole motor learning process. The contrast of simulation
experiments shows that the proposed method has im-
proved the learning algorithm with only basal ganglia
mechanism in the aspect of learning speed and failure
time. After many rounds of learning and the temperature
decreasing, the certainty of system has been reinforced
eventually along with the decreasing of entropy. The
vibration is disappearing eventually, and the learning
speed and accuracy have further improved. The principal
contributions of this paper are that a motor learning
method on the neurophysiology with the coordination of
cerebellum and basal ganglia, on which a central nervous
system for a self-balancing two-wheeled robot can be
constructed.
7. Acknowledgements
We acknowledge support from National Natural Science
Foundation of China (No. 61075110), China’s 863 Pro-
gram (No. 2007AA04Z226), Key Project (KM20081000
5016) of S&T Plan of Beijing Municipal Commission of
Education, and Beijing Natural Science Foundation (No.
4102011).
8. References
[1] J. C. Houk and S. P. Wise, “Distributed Modular Archi-
tectures Linking Basal Ganglia, Cerebellum, and Cere-
Bral Cortex: Their Role in Planning and Controlling Ac-
tion,” Cerebral Cortex, Vol. 5, No. 2, 1995, pp. 95-110.
doi: 10.1093/cercor/5.2.95
[2] X. Lu, O. Hikosaka and S. Miyachi, “Role of Monkey
Cerebellar Nuclei in Skill for Sequential Movement,”
Journal of Neurophysiology, Vol. 79, No. 5, 1998, pp.
2245-2254.
[3] K. Doya, “What are the Computations of the Cerebellum,
the Basal Ganglia and the Cerebral Cortex?” Neural Net-
works, Vol. 12, No. 7-8, 1999, pp. 961-974.
doi:10.1016/S0893-6080(99)00046-5
[4] J. C. Houk, “Agents of the mind,” Biological Cybernetics,
Vol. 92, No. 6, 2005, pp. 427-437.
doi:10.1007/s00422-005-0569-8
[5] E. I. Knudsen, “Supervised Learning in the Brain,” The
Journal of Neuroscience, Vol. 14, No. 7, 1994. pp. 3985-
3997.
[6] T. V. P. Bliss and G. L. Collingridge, “A Synaptic Model
of Memory: Long-Term Potentiation in the Hippo-Cam-
pus,” Nature, Vol. 32, No. 2, 1993, pp. 31-39.
doi:10.1038/361031a0
[7] S. E. Hua and J. C. Houk, “Cerebellar Guidance of Pre-
motor Network Development and Sensorimotor Learn-
ing,” Learning and Memory, Vol. 4, No. 1, 1997, pp.
63-76. doi:10.1101/lm.4.1.63
[8] B. Girard, N. Tabareau, Q. C. Phama, A Berthoz and J. J.
Slotine, “Where Neuroscience and Dynamic System
Theory Meet Autonomous Robotics: A Contracting Basal
Ganglia Model for Action Selection,” Neural Networks,
Vol. 21, No. 4, 2008, pp. 628-641.
doi:10.1016/j.neunet.2008.03.009
[9] R. Bogacz and K. Gurney. “The Basal Ganglia and Cor-
tex Implement Optimal Decision Making between Alter-
native Actions,” Neural Computation, Vol. 19, No. 2,
2007, pp. 442-477. doi:10.1162/neco.2007.19.2.442
[10] W. M. Jonathan, “The Basal Ganglia: Focused Selection
and Inhibition of Competing Motor Programs,” Progress
in Neurobiology, Vol. 50, No. 4, 1996, pp. 381-425.
doi:10.1016/S0301-0082(96)00042-1
[11] M. X. Cohen and M. J. Frank, “Neurocomputational
Models of Basal Ganglia Function in Learning, Memory
and Choice,” Behavioural Brain Research, Vol. 199, No.
1, 2009, pp. 141-156. doi:10.1016/j.bbr.2008.09.029
[12] B. W. Balleine, M. Liljeholm and S. B. Ostlund, “The
Integrative Function of the Basal Ganglia in Instrumental
Conditioning,” Behavioural Brain Research, Vol. 199,
No. 1, 2009, pp. 43-52.
doi:10.1016/j.bbr.2008.10.034
[13] P. Dean and J. Porrill, “Adaptive-Filter Models of the
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
224
Cerebellum: Computational Analysis,” The Cerebellum,
Vol. 7, No. 4, 2008, pp. 567-571.
doi:10.1007/s12311-008-0067-3
[14] P. Dean, J. Porrill, C. F. Ekerot E and J. Henrik, “The
Cerebellar Microcircuit as an Adaptive Filter: Experi-
mental and Computational Evidence,” Nature Reviews
Neuroscience, Vol. 11, No. 1, 2010, pp. 30-43.
doi:10.1038/nrn2756
[15] M. Kawato, “Feedback-Error-Learning Neural Network
for Supervised Motor Learning,” Elsevier, Amsterdam,
1990.
[16] M. Kawato and H. Gomi, “A Computational Model of
Four Regions of the Cerebellum Based on Feeback-Error-
Learning,” Biological Cybernetics, Vol. 68, No, 2, 1992,
pp. 95-103.
doi:10.1007/BF00201431
[17] D. Joel, Y. Niv and E. Ruppin, “Actor-Critic Models of
the Basal Ganglia: New Anatomical and Computational
Perspectives,” Neural Network, Vol. 15, No. 4-6, 2002,
pp. 535-547. doi:10.1016/S0893-6080(02)00047-3
[18] M. Khamassi, L. Lachèze, B. Girard, A. Berthoz and A.
Guillot, “Actor-Critic Models of Reinforcement Learning
in the Basal Ganglia: from Natural to Artificial Rats,”
Adaptive Behavior, Vol. 13, No. 2, 2005, pp. 131-148.
doi:10.1177/105971230501300205
[19] B. Girard, N. Tabareau, Q. C. Phama, A. Berthoz and J. J.
Slotine, “Where Neuroscience and Dynamic System
Theory Meet Autonomous Robotics: A Contracting Basal
Ganglia Model for Action Selection,” Neural Networks,
Vol. 21, No. 4, 2008, pp. 628-641.
doi:10.1016/j.neunet.2008.03.009
[20] K. Doya, “Complementary Roles of Basal Ganglia and
Cerebellum in Learning and Motor Control,” Current
Opinion in Neurobiology, Vol. 10, No. 6, 2000, pp.
732-739. doi:10.1016/S0959-4388(00)00153-7
[21] C. Ye, N. H. C. Yung and D. W. Wang, “A Fuzzy Con-
troller with Supervised Learning Assisted Reinforcement
Learning Algorithm for Obstacle Avoidance,” IEEE
Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics, Vol. 33, No. 1, 2003, pp. 17-27.
doi:10.1109/TSMCB.2003.808179
[22] M. J. Er and C. Deng, “Obstacle Avoidance of a Mobile
Robot Using Hybrid Learning Approach,” IEEE Trans-
actions on Industrial Electronics, Vol. 52, No. 3, 2005,
pp. 898-905. doi:10.1109/TIE.2005.847576
[23] M. T. Rosenstein and A. G. Barto, “Learning and Ap-
proximate Dynamic Programming: Scaling Up to the
Real World,” IEEE Press and John Wiley & Sons, Inc.,
New York, 2004.
[24] J. A. Clouse and P. E. Utgoff, “A Teaching Method for
Reinforcement Learning,” Proceedings of the Nineth In-
ternational Conference on Machine Learning, San Fran-
cisco, 1992, pp. 92-101.
[25] H. Benbrahim, and J. A. Franklin, “Biped Dynamic
Walking Using Reinforcement Learning,” Robotics and
Autonomous Systems, Vol. 22, No. 3-4, 1997, pp. 283-
302. doi:10.1016/S0921-8890(97)00043-2
[26] K. Pathak, J. Franch and S. K. Agrawal, “Velocity and
Position Control of a Wheeled Inverted Pendulum by
Partial Feedback Linearization,” IEEE Transactions on
Robotics, Vol. 21, No. 3, 2005, pp. 505-513.
doi:10.1109/TRO.2004.840905
[27] A Blankespoor and R Roemer, “Experimental Verifica-
tion of the Dynamic Model for a Quarter Size
Self-Balancing Wheelchair,” Proceedings of the 2004
American Control Conference, Boston, 2004, pp. 488-
492.
[28] F Grasser, A. D’Arrigo, S. Colombi and A. C. Rufer,
“JOE: A Mobile, Inverted Pendulum,” IEEE Transac-
tions on Industrial Electronics, Vol. 49, No. 1, 2002, pp.
107-114. doi:10.1109/41.982254
[29] D. P. Anderson, “NBot Balancing Robot, a Two Wheel
Balancing Robot,” 2003.
http://www.geology.smu.edu/~dpa-ww/robo/nbot/index.
html.
[30] C.-H. Chiu, “The Design and Implementation of a
Wheeled Inverted Pendulum Using an Adaptive Output
Recurrent Cerebellar Model Articulation Controller,”
IEEE Transactions on Industrial Electronics, Vol, 57, No.
5, 2010, pp. 1814-1822. doi:10.1109/TIE.2009.2032203
[31] S. Jung and S. S. Kim, “Control Experiment of a
Wheel-Driven Mobile Inverted Pendulum Using Neural
Network,” IEEE Transactions on Control Systems Tech-
nology, Vol. 16, No. 2, 2008, pp. 297-303.
doi:10.1109/TCST.2007.903396
[32] L. Q. Han and X. Y. Tu, “Study of Artificial Brain Based
on Multi-Centrum Self-Coordination Mechanism,” Sci-
ence Press, Beijing, 2009.
[33] C. Chen and R. F. Thompson, “Temporal Specificity of
Long-Term Depression in Parallel Fiber - Purkinje Syn-
apses in Rat Cerebellar Slice,” Learning and Memmory,
Vol. 2, No. 3-4, 1995, pp. 185-198.
doi:10.1101/lm.2.3-4.185
[34] B. F. Skinner, “The Behavior of Organisms,” Apple-
ton-Century-Crofts, New York, 1938.
[35] I. P. Pavlov, “Conditioned Reflexes,” Oxford University
Press, Oxford, 1927.
[36] B. Brembs, W. Plendl, “Double Dissociation of PKC and
AC Manipulations on Operant and in Drosophila,” Cur-
rent Biology, Vol. 18, No. 15, 2008, pp. 1168-1171.
doi:10.1016/j.cub.2008.07.041
[37] C. W. Yao and G. C. Chen, “A Emotion Development
Agent Model Based on OCC Model and Operant Condi-
tioning,” 2001 International Conferences on Info-Tech
and Info-Net Proceedings, Beijing, 2001, pp. 246-250.
[38] K.Itoh, H. Miwa, M. Matsumoto, M. Zecca, M. Takanobu,
H. Roccella, S. Carrozza, M. C. Dario and P. Takanishi,
“Behavior Model of Humanoid Robots Based on Operant
Conditioning,” Proceedings of 2005 5th IEEE-RAS In-
ternational Conference on Humanoid Robots IEEE/RAS
International Conference on Humanoid Robots, Tsukuba,
2005, pp. 220-225.
[39] J. S. Leng, L. Jain and C. Fyfe, “Convergence Analysis
on Temporal Difference Learning. International Journal
X. G. RUAN ET AL.
Copyright © 2011 SciRes. ICA
225
of Innovative Computing,” Information and Control, Vol.
5, No. 4, 2009, pp. 913-922.
[40] X. G. Ruan, “Neural Computational Science: Simulation
Brain Function at the Cellular Level,” National Defense
Industry Press, Beijing, 2006.
[41] M. Murata and S. Ozawa, “A Reinforcement Learning
Model Using Deterministic State-Action Sequences,” In-
ternational Journal of Innovative Computing, Informa-
tion and Control, Vol. 6, No. 2, 2010, pp. 577-590.
[42] J. Randlov, A. G. Barto and M. T. Rosenstein, “Combin-
ing Reinforcement Learning with a Local Control Algo-
rithm,” Proceedings of the Seventeenth International
Conference on Machine Learning, Vol. 1, No. 4, 2000, pp.
775-782.
[43] C. T. Chiang and C. S. Lin, “CMAC with General Basis
Functions,” Neural Network, Vol. 9, No. 7, 1996, pp.
1199-1211. doi:10.1016/0893-6080(96)00132-3
[44] A. V. Lazo and P. Rathie, “On the Entropy of Continuous
Probability Distributions,” IEEE Transactions on Infor-
mation Theory, Vol. 24, No. 1, 1978, pp. 120-122.
doi:10.1109/TIT.1978.1055832
Appendix
The detail dynamic model of self-balancing two-wheeled
robot is as follows.


,
M
qq NqqET
 
  (17)
where,


11 1213
21 2223
31 3233
1
2
3
;
11
,;10
01
mmm
Mqm m m
mmm
n
Nqqn E
n






 
 

 
 




2
11
y
mMLJ
2131 12 13
1cos
2b
mmmm MLR

 
2
2222
22 2
1sin
4
lw bz
R
mm RJMRMLJ
d
 

2
222
23 322
1sin
4bz
R
mm MRMLJ
d
 ;
 
2
2222
33 2
1sin
4
rw bz
R
mmRJMRMLJ
d
 

22
1sinsin cos
2
bbb
wbwlr
nMgL ML


 




2
22
2
1
sin 2sin
2
lrbbbb
lwlwb
R
nMLMLR
d

 





 


;


2
22
3
1
sin 2sin
2
lrbbb b
rwrwb
R
nML MLR
d
 


 


 

