Control Task for Reinforcement Learning with Known Optimal Solution for Discrete and Continuous Actions

doi:10.4236/jilsa.2009.11002

Paper Menu >>

Journal Menu >>

J. Intelligent Learning Systems & Applications, 2009, 1, 28-41

Published Online November 2009 in SciRes (www.SciRP.org/journal/jilsa)

Control Task for Reinforcement Learning with Known

Optimal Solution for Discrete and Continuous Actions

Michael C. RÖTTGER1, Andreas W. LIEHR2

Service-Group Scientific Information Processing, Freiburg Materials Research Center, Freiburg, Germany.

Email: 1roettm@gmx.de, 2Liehr@fmf.uni-freiburg.de

Received November 10th; 2008; revised January 29th, 2009; accepted January 30th, 2009.

ABSTRACT

The overall research in Reinforcement Learning (RL) concentrates on discrete sets of actions, but for certain real-world

problems it is important to have methods which are able to find good strategies using actions drawn from continuous

sets. This paper describes a simple control task called direction finder and its known optimal solution for both discrete

and continuous actions. It allows for comparison of RL solution methods based on their value functions. In order to

solve the control task for continuous actions, a simple idea for generalising them by means of feature vectors is

presented. The resulting algorithm is applied using different choices of feature calculations. For comparing their

performance a simple measure is introduced.

Keywords: comparison, continuous actions, example problem, reinforcement learning, performance

1. Introduction

In Reinforcement Learning (RL), one solves optimal

control problems without knowledge of the underlying

system’s dynamics on the basis of the following perspec-

tive: An agent which is aware of the current state of its

environment, decides in favour of a particular action.

The performance of the action results in a change of the

agent’s environment. The agent notices the new state,

receives a reward, and decides again. This process is

repeated over and over and may be terminated by reach-

ing a terminal state. In the course of time the agent learns

from its experience by developing a strategy which

maximises its estimated total reward.

There are various RL methods for searching optimal

policies. For testing, many authors refer to standard con-

trol tasks in order to enable comparability; others define

their own test-beds or use a specific real-world problem

they want to solve. Popular control tasks used for refer-

ence are e.g. the mountain car problem [1,2,3], the in-

verted pendulum problem [4], the cart-pole swing-up

task [5,6,7] and the acrobot [2,8,9].

In most papers, algorithms are compared by their ef-

fectiveness in terms of maximum total rewards of the

policies found or the number of learning cycles, e.g.

episodes, needed to find a policy with an acceptable

quality. Sometimes the complexity of the algorithms is

also taken into account, because a fast or simple method

is more likely to be usable in practice than a very good

but slow or complicated one. All these criteria allow for

some kind of black-box testing and can give a good es-

timation whether an algorithm is suitable for a certain

task compared to other RL approaches. However, they

are not very good at giving a hint where to search for

causes why an algorithm cannot compete with others or

why, contrary to expectations, no solution is found. The

possible reasons for such kinds of failure are manifold,

starting from bad concepts to erroneous implementa-

tions.

Many ideas in RL are based on representing the

agent’s knowledge by value functions, so their compari-

son among several runs with different calculation

schemes or with known optimal solutions gives more

insight into the learning process and supports systematic

improvements of concepts or implementations.

One control task suitable for this purpose is the double

integrator (e.g. [10]), which has an analytically known

optimal solution for a bounded continuous control vari-

able (the applied force) and the corresponding

state-value function V(s) can be calculated. Since this

produces a bang-bang-like result, the solution for a con-

tinuous action variable does not differ from a dis-

crete-valued control.

In this paper, we present a simple control task, which

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR29

has a known optimal solution for both discrete and con-

tinuous actions, with the latter allowing for better cumu-

lative rewards. This control task is a simplified variant of

the robot path finding problem [11,12]. Since its state is

described by real-valued numbers, it is useful to calibrate

new algorithms working on continuous state spaces be-

fore tackling more difficult control tasks in order to rule

out the teething problems. It allows for a detailed com-

parison of RL solutions in terms of the state-value func-

tion V and action-value function Q and may serve for

didactic purposes when discussing solution methods for

discrete and continuous actions.

In real-world problems, the state space is often un-

countable and infinitely large, e.g. when the state is de-

scribed by real numbers. For this kind of tasks, simple

tables cannot be used as memory for saving all possible

values of V(s) and Q(s, a). A common way to handle this

problem is the usage of function approximation, which

introduces some inaccuracy to the system. Knowing an

exact solution may help to develop heuristics to find ap-

propriate parameters approximating the optimal solution

well.

The paper is structured as follows: Section 2 intro-

duces the notation describing RL problems and their so-

lutions. The proposed control task and its optimal solu-

tion are presented in Section 3. To solve this problem for

discrete and continuous actions by means of RL we ap-

plied a well-known algorithm combined with feature

vectors for actions. This is discussed in Section 4. In

Section 5 we present simulation results for discrete and

continuous actions. Section 6 concludes the paper with a

summary and an outlook.

2. Notation

In terms of RL, a control task is defined for a specific

environment, in which an agent has to reach an objective

by interacting with the environment. The learning proc-

ess is organised in one or more episodes. An episode

starts with an initial state s0. It can either last forever

(continuing task) or end up with a terminal or absorbing

state ST after performing T actions (episodic task). At

time t, the agent becomes aware of the current state st of

the environment and decides in favour of an action at.

Performing this action results in a new state st+1 and the

agent receives a reward rt+1. For an episodic task, the

sequence of states, rewards and actions is denoted as

0121

011 22

(,)(,)( ,).

aaaa

rsrsr s



 

Figure 1 shows the information exchange between the

agent and its environment.

A deterministic policy



is a mapping from states



S to actions a



 : S  A.

In general, for a given state s, the action space A(s)

also depends on the state s, because in some states some

actions may not be available. Here, the state should be

representable as a tuple of n real numbers, so Sn.

The action space is considered in two variants: A finite

set 1

={ ,,}

aa, whose elements will be denoted as

discrete actions, and alternatively an infinite set Am,

whose elements are called continuous actions. For the

control task presented in Section 3, both cases are dis-

cussed. In general and for real-world control tasks, also

mixed action spaces having both discrete and continuous

actions are imaginable.

Figure 1. RL’s perspective of control tasks as described in [13]. The diagram shows the communication of the agent with the

environment.

30 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

In order to handle episodic and continuing tasks with

the same definitions for value functions, we follow the

notation of [13]: Episodic tasks with terminal state sT can

be treated as continuing tasks remaining at the absorbing

state without any additional reward: sT+k≡sT and rT+k≡0

for .



=1,2,3,k

Using this notion, for all states sS, the state-value

function

()= =

tk t

VsErs s















 (1)

concerning a given policy π holds the information, which

total discounted reward can be expected when starting in

state s and following the policy π. The quantity [0,1]





is called discount factor. For <1



the agent is myopic

to a certain degree, which is useful to rank earlier re-

wards higher than latter ones or to limit the cumulative

reward if , i.e. no absorbing state is reached.

=T

In analogy to the value of a given state, one can also

assign a value to a pair (s, a) of state

S and action

A(s). The action-value function a

(, )=,

tt t

QsaErs sa a























(2)

concerning a policy π is the estimated discounted cumu-

lative reward of an agent in state s, which decides to per-

form the action a and then follows π.

For the deterministic episodic control task proposed in

this paper we have chosen =1



, so the Equations (1)

and (2) simply reduce to finite sums of rewards.

In order to maximise the agent’s estimated total re-

ward, it is useful to define an ordering relation for poli-

cies: A policy π is “better than” a policy π', if

:()()VsV ss





 

S. (3)

An optimal policy complies with





 (4)

and is associated with the optimal value functions V

* and

Q* , which are unique for all possible optimal policies:

*(,):=(, ),QsaQ sa



 (5)

():=()=(,).

max

VsV sQsa



 (6)

Given the optimal action value function Q* and a current

state s, an optimal action a* can be found by evaluating

()=argmax(, ).

as Qsa

(7)

3. The Directing Problem and Its Solution

3.1 The Control Task

The directing problem is a simplified version of the path

finding problem [11,12] without any obstacles. The

agent starts somewhere in a rectangular box given by





max max

=(x,y)|0 xx0 yy. S (8)

Its real-valued position s=(x, y) is the current state of

the system.

The objective of the agent is to enter a small rectan-

gular area in the middle of the box (Figure 2). This target

area is a square with centre (cx,cy) and a fixed side length

of 2dT. Each position in the target area, including its

boundary, corresponds to a terminal state.

In order to reach this target area, the agent takes one or

more steps with a constant step size of one. Before each

step, it has to decide, which direction to take. If the

resulting position is outside the box, the state will not be

changed at all. So given the state s = (x, y), the action a =

, and using the abbreviation

(, )=(cos ,sin),xy xy





 (9 )

the next state =( ,)





max max

(, )if(00),

(, )=(, )otherwise.

xyx xyy

xy xy

 



 



 

(10)

In terms of RL, the choice of is the agent’s action

which is always rewarded with r ≡ -1 regardless of the

resulting position. By setting the discount factor =1



the total reward is equal to the negative number of steps

needed to reach the target area.

For the analytical solution and for the simulation re-

sults, xmax = ymax = 5, dT =1

2 and cx = cy= 2.5 have been

chosen.

The optimal solution for this problem, although it is

obvious to a human decision maker, has two special fea-

tures:

 In general, the usage of continuous angles allows

for larger total rewards than using only a subset of

discrete angles and

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR31

Figure 2. The direction finder. The state of the agent is its real-valued position somewhere in the simulation domain. Its

objective is to enter the shaded target area by choosing a direction before each constant-sized step. This choice may be

completely free (continuous actions) or limited to a finite number of four angles (discrete actions). The reward after each step

is -1, so an optimal policy would minimise the total number of steps.

 the state value function V

* and Q* can be expressed

analytically, so intermediate results of reinforce-

ment learning algorithms can be compared to the

optimal solution with arbitrary precision.

In the following, the solutions for discrete and con-

tinuous actions are described.

3.2 Discrete Actions

For a finite set

(11)

of four discrete directions, which can be coded as

0,, ,













using angles in radians, the state value for

an optimal policy for an arbitrary position (x, y) is given

discr (, )=(())(()).

Vxy hgxhgy   (12)

The symbol denotes the ceiling function, which

returns the smallest integer equal to or greater than a

given number, and the terms and

are abbreviations for



(())

hg x(())

hgy

()=| |

xxcd (13)

()=| |

yyT

yycd (14)

0for 0

()=zfor 0

hz z







 (1 5)

The value of a terminal state is zero, since no step is

needed.

Figure 3 shows a plot of the optimal state-value func-

tion for discrete actions. Due to the limited set of direc-

tions, the agent needs four steps when starting in a corner

of the domain.

3.3 Continuous Actions

If arbitrary angles [0,2 ]





are allowed, i.e. continu-

ous actions, the optimal state value is given by:

cont(,) =(())(()).

Vxyhgxhgy











 (16)

As for discrete actions, the helper functions defined in

(13), (14), and (15) have been used.

Figure 4 shows a plot of the optimal state-value func-

tion for continuous actions. Contrary to the case of dis-

crete actions, the agent only needs three steps at the most

to enter the target area. Except for the set of possible

directions, the task specification is the same as for dis-

crete actions.

32 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

Figure 3. Optimal state value function V *(x, y) for four discrete actions → , ↑ , ← and ↓ . Because of 1



 it is equal to the

negative number of steps needed in order to enter the target square in the centre, so V

*(x, y)





4, 3,1,02, . xmax = ymax =

5, dT =1

2 and cx = cy= 2.5.

Figure 4. Optimal state value function V

*(x, y) for continuous actions, i.e. [0, 2[





. The state value is equal to the negative

number of steps needed to enter the target square in the centre, so V

*(x, y)





3, 2,1,0 . xmax = ymax = 5, dT =1

2 and cx = cy=

2.5.

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR33

Figure 5. Action values for all states when choosing direction → (or =0) as first action and following an opti- mal policy

with discrete actions afterwards. The corresponding state value function is depicted in Figure 3. Values rank from −5 to 0.

Figure 6. Action values for all states when choosing direction → (or =0) as first action and following an optimal policy for

continuous actions afterwards. The according state value function is depicted in Figure 4. Values rank from −4 to 0.

34 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

3.4 Action Values

The problem as defined above is deterministic. With

given and

(, )Vxy

=1



, the evaluation of the corre-

sponding action value function is as simple as calculating

the state value of the subsequent state (x', y') by means of

(9) and (10), except for absorbing states:

Q*(x,y,) (17)

0 if(x,y)isterminal,

=( ,) 1otherwise,







Vxy

where the next state (x', y') is defined by (10). Its optimal

state value V *(x', y') is computed through evaluation of

(12) for discrete actions or (16) for continuous actions.

For the direction →, or =0, the action value func-

tion Q* (x, y, →) is depicted in Figure 5 for four discrete

angles (→, ↑, ←,↓ ) and in Figure 6 for continuous an-

gles. Disregarding the target area both plots can be de-

scribed by V

* (s)－1 shifted to the left with a respective

copy of the values for to .

]3,4]x]4,5]x

4. Algorithm for Searching Directions with

Reinforcement Learning

In the following an algorithm is presented which is capa-

ble of solving the direction finder problem for discrete

and continuous actions. It is based on a linear approxi-

mation of the action value function Q and is often

used and well-known for discrete actions. Note, that we

have slightly changed the notation in order to apply ideas

from state generalisation to the generalisation of actions,

which enables us to handle continuous actions.



4.1 Handling of Discrete and Continuous Actions

When searching for a control for tasks with a small finite

number of actions (discrete actions), it is possible to as-

sign an individual approximation function (, )()



QsaQs

ble action. When using linear approxima-

tion, this is often implemented by using a fixed feature

vector for every state s independent of an action a and a

separate weight vector θa for every action:

to every possi

()= 





(s) (18)

For a very large set of possible actions, e.g. if the ac-

tion is described by (dense) real-valued numbers drawn

from an interval (continuous actions), this approach

reaches its limits. In addition, the operation(, )

maxaQsa



becomes very expensive when implemented as brute

force search over all possible actions and can hardly be

performed within a reasonable time.

There are many ideas for dealing with continuous ac-

tions in the context of Reinforcement Learning. Exam-

ples are the wirefitting approach [14], the usage of an

incremental topology preserving map building an aver-

age value from evaluating different discrete actions [15]

and the application of neural networks (e.g. [16]).

To the best of our knowledge, so far no method has

been shaped up as the algorithm of choice; so we want to

contribute another idea which is rather simple and has

been successfully applied to the direction finder problem.

The basic idea, which is more a building block than a

separate method, is not only to use feature vectors for

state generalisation, but also to calculate separate feature

vectors for actions. We combine state and action features

in such a manner that each combination of feature com-

ponents for a state and for an action uses a separate

weight. If one has a combined feature



(s,a), the linear

approximation can be expressed as

(, )=Qsa







(s,a) (19)

where each weight, a component of θ, refers to a combi-

nation of certain sets of states and actions. This raises the

question how to build a combined feature vector



(s,a).

Here, two possibilities are considered.

Having a state feature and an action fea-

ture , with components

and (20)

(21)

an obvious way of combination would be to simply con-

catenate (CC) them:

(22)

(23)

Concerning linear approximation, this would lead to a

separation in weights for states and actions which is not

suitable for estimating Q(s, a) in general, because the

coupling of states and actions is lost:

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR35

(24)

Alternatively, one can form a matrix product (MP) of

the feature vectors which results in a weight for every

combination of components of the vectors and

building a new feature vector:

(25)

with =1,,, =1,,.

injn

The resulting matrix has been reshaped to a vector by

concatenating the rows in order to stick with the termi-

nology of a feature vector. This kind of feature assem-

bling has been used for the results in the following sec-

tions. From now on, the tag MP is left out for better

readability.

With this kind of combined feature vector, the known

case of discrete actions with separate weights for each

action can be expressed by means of (19). This is done

by choosing the action’s feature vector as

(26 )

with and

for =i

aa and 0 otherwise. This is equivalent to the

usage of binary features with only one feature present,

the latter corresponds to the action a.

For continuous actions, all methods to calculate fea-

ture vectors for states can also be considered to be used

for actions. One has to choose some appropriate way to

generalise every action by a feature vector.

There are several ideas for procedures to compute a

feature vector. One of the most popular is the usage of tile

coding in a Cerebellar Model Articulation Controller

(CMAC) architecture, e.g. successfully applied for con-

tinuous states in [2]. A comparison of radial-based fea-

tures (RBF) with CMAC for function approximation is

given in [17]. In the following, we present different kinds

of state and action features used for the direction finder.

4.2 State Features

One of the simplest feature types are binary features. For

states of the direction finder, we used the square-shaped

areas

(2 7)

for with centres

,=1, ,20ij 

2121

=,.

(28)

c







For binary state features we evaluate

(29 )

Here s denotes the tuple (x, y). Row by row, the com-

ponents of (29) are combined to a feature vector

Similarly, for radial-based state features we used

(30)

Again, the components are combined to a feature vec-

tor which is additionally normalised afterwards:

(31)

4.3 Action Features

For the set of discrete actions Equation (26) results in a

binary feature vector

(32)

For continuous actions a, i.e. for arbitrary angles

[0,2 ]





, we use the sixteen components

(33)

with and

=1, ,16l. These compo-

nents decay quite fast and can be regarded as zero when

used at a distance of 2π from the centers of the

Gaussians:

(34)

This is used to build feature vectors accounting for the

periodicity

(, , )=(, ,2),QxyQxy







which should be met by the approximation (, ,)Qxy



.

In order to approximately achieve this, we combine the

Gaussians as

36 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

(35)

and define a normalised action feature vector

(36)

which is finally used in (25) as .

Using Equations (34), (35), (25) and (19) it follows that

Feature vectors for continuous angles built like this are

henceforth called cyclic radial-based. In order to have a

second variant of continuous action features at hand, we

directly combine the components defined in (33)

to a feature vector and normalise it in a way equivalent

to (36). These features are denoted with the term ra-

dial-based and are not cyclic.

4.4 Sarsa( λ ) Learning

For the solution of the direction finder control task, the

gradient-descent Sarsa(λ) method has been used in a

similar way as described in [13] for a discrete set of ac-

tions when accumulating traces. The approximation

of Q is calculated on the basis of (19) using (23) for



(s,a)

with features as described in Subsections 4.2 and 4.3.

The simulation started with =0 . There were no

random actions and the step-size parameter



tθ



for the

learning process has been held constant at 0.5=



Since the gradient of is



(s,a), the update rule can

be summarised as



111

=(,;)(,;

ttttt ttt

rQsaQsa









θθ)



(s)

=rt+1+



tT(



(st+1,at+1)-(st,at)) (37)

ttt



θθδe (38)

et+1=et+



(st+1,at+1) (39)

The vector et holds the eligibility traces and is initial-

ised with zeros for t = 0 like θt. An episode stops when

the agent encounters a terminal state or when 20 steps

are taken.

4.5 Action Selection

Having an approximation (s,a) at hand, one has to

provide a mechanism for selecting actions. For our ex-

periments with the direction finder, we simply use the

greedy selection



()=argmax(, ).

Qsa



 (40)

For discrete actions, π(s) can be found by comparing

the action values for all possible actions. For continuous

actions in general, the following issues have to be con-

sidered:

1) standard numerical methods for finding extrema

normally search for a local extremum only,

2) the global maximum could be located at the

boundary of (s), and

3) the search routine operates on an approximation

(s,a) of Q(s,a).

For the results of this paper, the following approach

has been taken: The first and the second issue are faced

by evaluating (, , )Qxy



 at 100 uniformly distributed

actions besides =0 and =2π, taking the best candi-

date in order to start a search for a local maximum. The

latter has been done with a Truncated-Newton method

using a conjugate-gradient (TNC), see [18]. In general,

in the case of multidimensional actions , the problem

of global maxima at the boundary of

(s) may be tack-

led by recursively applying the optimisation procedure

on subsets of (s) and fixing the appropriate compo-

nents of a at the boundaries of the subsets. The third is-

sue, the approximation, cannot be compassed, but allows

for some softness in the claim for accuracy, which may

be given as an argument to the numerical search proce-

dure.

5. Simulation Results

In the following, the direction finder is solved by apply-

ing variants of the Sarsa(λ) algorithm, which differ in the

choice of features for states and actions. Each of them is

regarded as a separate algorithm whose effectiveness

concerning the direction finder is expressed by a single

number, the so-called detour coefficient. In order to

achieve this, the total rewards obtained by the agent are

related to the optimal state values V *(s0), where s0 is the

randomly chosen starting state varying from episode to

episode.

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR37

5.1 Detour Coefficient

Let be a set of episodes. We define a measure called

detour coefficient



() by

(41)

Here, Re is the total (discounted) reward obtained by

the agent in episode e and is the respective initial

state. The quantity |

| denotes the number of episodes in

. In our results the ||=1000 last non-trivial episodes

have been taken into account to calculate the detour co-

efficient by means of (41), whereas non-trivial means

that the episodes start with a non-terminal state, so the

agent has to decide for at least one action. For V * one has

to choose from Equations (12) and (16) according to the

kind of actions used.

The smaller



(), the more effective a learning pro-

cedure is to be considered. For the direction finder, the

value of



() can be interpreted as the average number

of additional steps per episode the RL agent needs in

comparison to an omniscient agent. It always holds that



( ), because the optimal solution cannot be out-

performed.

0

The detour coefficient depends on the random set of

starting states of the episodes e, therefore we con-

sider



() as an estimator and assign a standard error for

the mean by evaluating

(42)

with .

=*()

eVs R





For the following results, the set of randomly chosen

starting states is the same for the algorithms compared to

each other.

5.2 Discrete Actions

For discrete actions, we have solved the direction finder

problem for two different settings, with binary state fea-

tures according to (29) and with radial-based state fea-

tures according to (30) and (31). The other parameters

were: 10000 episodes, 0.5=



, 0.3



, max 20T



Calculating the detour coefficient (41) for the last 1000

episodes not having a terminal initial state we observe



()=0 for binary state features. This means, these epi-

sodes have an optimal outcome. Repeating the simula-

tion with radial-based state features, there were six epi-

sodes of the last 1000 non-trivial episodes which in total

took 12 steps more than needed, so



( ).

The respective approximation Qx

0.012 0.005

, )(,y





(Qx



after 10000

episodes for radial-based state features is shown in Fig-

ure 7a as image plot which is overlayed by a vector plot

visualising the respective greedy angles. An illustration

of the difference *(, )max

Vxy ,, )y



 for sample

points on a grid is depicted in Figure 7b. Due to (6) such

a plot can give a hint, where the approximation is inac-

curate. Here the discontinuities of V * stand out.

Both kinds of features are suitable in order to find an

approximation from which a good policy can be

deduced, but the binary features perform perfectly in the

given setting, which is not surprising because the state

features are very well adapted to the problem.



5.3 Continuous Actions

For continuous actions, which means [0, 2[



, solu-

tions of the direction finder problem have been obtained

for four different combinations of state and action fea-

tures. The feature vectors have been constructed as de-

scribed in Subsections 4.2 and 4.3. The results are sum-

marized in Table 1.

For all simulations the following configuration has

been used: 0.5=



, 0.3





, maximum episode time

max 20T



, and 20000 episodes. The values of



() and



() have been calculated as discussed in Subsection

5.1. Like for binary state features and discrete actions,

the last 1000 non-trivial episodes did not always result in

an optimal total reward, so



()>0 for every algorithm.

Our conclusion with respect to these results for the given

configuration and control task is:

 Radial-based state features are more effective than

binary state features.

 When using radial-based state features, cyclic ra-

dial-based action features outperform radial-based

features lacking periodicity.

Table 1. Results for continuous actions for different

combinations of state and action features:

bin (binary), rb (radial-based), crb (cyclic radial-based).

state featuresaction features η(ε) sη(ε)

bin rb 0.109 0.011

bin crb 0.115 0.013

rb rb 0.090 0.011

rb crb 0.063 0.009

38 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

(a)

(b)

Figure 7. Results for learning with discrete actions, calculated by performing 10000 episodes, using radial-based state

features. Possible directions: . For parameters, see Subsection 5.2. For these plots, the value functions and

the policy have been evaluated on 40x40 points on a regular grid in the state space. (a) Discrete greedy actions

φ{,,,}∈

argmax

gφ

=Q(x,

)

 and their values 

Q(x . The angles corresponding to the greedy actions are denoted by

arrowheads. (b) Difference between

, y,

(, )



φ)

and (

maxφQx,y,)



. The largest deviations are at the lines of discontinuity of

(, )



. Compare to Figure 3.

MICHAEL C. RÖTTGER, ANDREAS W. LIEHR39

(a)

(b)

Figure 8. Results for learning with continuous actions, calculated by performing 20000 episodes, using radial- based state

features and cyclic radial-based action features. In principal, all directions are selectable: . For parameters see

Subsection 5.3. Like in Figure 7, for the sampling of a regular grid of 40x40 points has been used. (a)

Continuous greedy actions

[∈φ0,2π[

)(

maxφQx,y,φ

argmax

gφ

φ=Q(x,

,φ)

 and their values 

Q(x

(, )



, y,φ). The angles corresponding to the greedy

actions are denoted by arrowheads. (b) Difference between

and ()



Qx,y,

maxφ

. The largest deviations are located

near the discontinuities of (,)



. Compare with Figrue 4.

40 MICHAEL C. RÖTTGER, ANDREAS W. LIEHR

Figure 8 illustrates the function ),,(

max



yxQ



and

its comparison with V*(x, y) for radial-based state fea-

tures and cyclic radial-based action features after 20000

episodes. Monitoring this difference during the learning

process helps to understand where difficulties for the

approximation scheme or the action selection arise. Here,

like for discrete actions, the discontinuities of

stand out. In order to get additional information, one can

add an estimation procedure for resulting in an ap-

proximation , e.g. by using the TD(

(, )y





) algorithm

[13,19], which runs independently in parallel to the

learning process. Inspecting the evolution of supports

an assessment about the covering of the state space.



6. Conclusions and Outlook

The direction finder is a control task in the domain of

real-valued state spaces which allows choosing either

discrete or continuous actions. For both cases, the opti-

mal value functions and are known analytically

and can be compared with their approximations in detail.

The optimal policy for continuous actions is not

bang-bang-like and generally results in at least as good

or better total rewards than the optimal policy used for

four discrete actions. In general, the application of fine

discretisations of the action space in order to find a con-

tinuous policy is not an option. In the case of the direc-

tion finder, one could think about using

discrete angles because the solution should converge to

the solution for optimal angles, but the mere effort to

find a greedy action increases linearly with the number

of actions. In general, problems with continuous actions

cannot be solved in a satisfying way by simply refining

the methods for discrete actions but require new ideas.

The development of these ideas is supported by evident

control tasks like the direction finder.

8,16,32, 

Together with a measure like the detour coefficient

(41), the direction finder can serve as a tool for testing

new RL algorithms working on real-valued state spaces

for both discrete and continuous actions allowing for

intermediate comparison of the approximations and

with the known optimal value functions and .

The detour coefficient can also be used in combination

with any other control task for which the optimal state

value function is known.



*

In order to solve the control task for continuous ac-

tions, we have introduced feature vectors for actions in

order to approximate . This may serve as a

building block in other Reinforcement Learning algo-

rithms.

(, )Qsa



The proposed algorithm has been applied to the direc-

tion finder control task for discrete and continuous ac-

tions using different combinations of features. In the case

of continuous actions, we are able to infer from the de-

tour coefficient that certain kinds of features are more

suitable than others.

There are many improvements which can be applied to

the search algorithm. So far, when searching for

, the linearity of together with the

knowledge of the feature calculation scheme has not

been taken into account. This may result in faster and

more accurate optimisation. In addition, it would be in-

teresting to use the detour coefficient to compare the

algorithm with completely different approaches for con-

tinuous actions or other kinds of action selection.

argmax(,)

aQsa

Q



Based on the known optimal solution, a debugging

procedure can be defined, which helps to ensure that the

basic capabilities of a new algorithm are present. As-

suming the existence of an interface which allows

switching between control tasks without changing the

implementation of the algorithm, the implementation can

be debugged with the presented direction finder control

problem whose solution is exactly known in detail.

Such a debugging procedure can be composed of five

subsequent steps:

1) Implement the proposed control task, the direction

finder, along with functions to calculate *

V(s) and

Q(s,a).

2) Prove the action selection capabilities by applying

it to *

Qand compare the outcome of several runs with

V(s0).

3) Prove the approximation abilities by “setting” the

current approximation Q

 of the action value function

to *

Qas closely as possible, e.g. by using minimal

least- squares.

4) Prove the stability of the algorithm by starting with

the approximation of *

Q from the previous step, per-

forming several learning cycles and testing for simi-

larity of *(,) and *

V(s).

maxaQsa

5) Prove the convergence of the algorithm by starting

with sparse or no previous knowledge and searching a

“good” approximation Q

(s,a) for *

Q(s,a).

These checks can be evaluated by comparing

maxaQ*(s,a) or with V*(s) or using a meas-

ure like the detour coefficient as defined in (39). If an

max(,)

aQsa



MICHAEL C. RÖTTGER, ANDREAS W. LIEHR41

algorithm fails to pass this procedure, it will probably

fail in more complex control tasks.

The direction finder is an example problem suitable

for the testing of algorithms which are able to work on

two-dimensional continuous state spaces, but there are

more use cases, e.g. with more dimensions of state

and/or action spaces. A collection of similar control

problems with known optimal solutions (e.g. more com-

plex or even simpler tasks) together with a debugging

procedure can comprise a toolset for testing and com-

paring various Reinforcement Learning algorithms and

their configurations respectively.

7. Acknowledgment

We would like to thank Prof. J. Honerkamp for fruitful

discussions and helpful comments.

REFERENCES

[1] J. A. Boyan and A. W. Moore, “Generalization in rein-

forcement learning: Safely approximating the value func-

tion,” in Advances in Neural Information Processing Sys-

tems 7, The MIT Press, pp. 369-376, 1995.

[2] R. S. Sutton, “Generalization in reinforcement learning:

Successful examples using sparse coarse coding,” in Ad-

vances in Neural Information Processing Systems, edited

by David S. Touretzky, Michael C. Mozer, and Michael

E. Hasselmo, The MIT Press, Vol. 8, pp. 1038-1044,

1996.

[3] W. D. Smart and L. P. Kaelbling, “Practical reinforce-

ment learning in continuous spaces,” in ICML’00: Pro-

ceedings of the Seventeenth International Conference on

Machine Learning, Morgan Kaufmann Publishers Inc., pp.

903-910, 2000.

[4] M. G. Lagoudakis, R. Parr, and M. L. Littman, “Least-

squares methods in reinforcement learning for control,”

in Proceedings of Methods and Applications of Artificial

Intelligence: Second Hellenic Conference on AI, SETN

2002, Thessaloniki, Greece, Springer, pp. 752-752, April

11-12, 2002.

[5] K. Doya, “Reinforcement learning in continuous time and

space,” Neural Computation, Vol. 12, pp. 219-245, 2000.

[6] P. Wawrzy ński and A. Pacut, “Model-free off-policy

reinforcement learning in continuous environment,” in

Proceedings of the INNS-IEEE International Joint Con-

ference on Neural Networks, pp. 1091-1096, 2004.

[

7] J. Morimoto and K. Doya, “Robust reinforcement learn-

ing,” Neural Computation, Vol. 17, pp. 335-359, 2005.

[8] G. Boone, “Efficient reinforcement learning: Model-

based acrobot control,” in International Conference on

Robotics and Automation, pp. 229-234, 1997.

[9] X. Xu, D. W. Hu, and X. C. Lu, “Kernel-based least

squares policy iteration for reinforcement learning,”

IEEE Transactions on Neural Networks, Vol. 18, pp.

973-992, 2007.

[10] J. C. Santamaría, R. S. Sutton, and A. Ram, “Experiments

with reinforcement learning in problems with continuous

state and action spaces,” Adaptive Behavior, Vol. 6, pp.

163-217, 1997.

[11] J. D. R. Millán and C. Torras, “A reinforcement connec-

tionist approach to robot path finding in non-maze-like

environments,” Machine Learning, Vol. 8, pp. 363-395,

1992.

[12] T. Fukao, T. Sumitomo, N. Ineyama, and N. Adachi,

“Q-learning based on regularization theory to treat the

continuous states and actions,” in the 1998 IEEE Interna-

tional Joint Conference on Neural Networks Proceedings,

IEEE World Congress on Computational Intelligence,

Vol. 2, pp. 1057-1062, 1998.

[13] R. S. Sutton and A. G. Barto, “Reinforcement learning:

An introduction of adaptive computation and machine

learning,” The MIT Press, March 1998.

[14] L. C. Baird and A. H. Klopf, “Reinforcement learning

with high-dimensional, continuous actions,” Technical

Report, WL-TR-93-1147, Wright-Patterson Air Force

Base Ohio: Wright Laboratory, 1993.

[15] J. D. R. Millán, D. Posenato, and E. Dedieu, “Continu-

ous-action q-learning,” Machine Learning, Vol. 49, pp.

247-265, 2002.

[16] H. Arie, J. Namikawa, T. Ogata, J. Tani, and S. Sugano.

“Reinforcement learning algorithm with CTRNN in con-

tinuous action space,” in Proceedings of Neural Informa-

tion Processing, Part 1, Vol. 4232, pp. 387-396, 2006.

[17] R. M. Kretchmar and C. W. Anderson, “Comparison of

CMACS and radial basis functions for local function ap-

proximators in reinforcement learning,” in International

Conference on Neural Networks, pp. 834-837, 1997.

[18] R. Dembo and T. Steihaug, “Truncated-Newton algo-

rithms for large-scale unconstrained optimization,”

Mathematical Programming, Vol. 26, pp.190-212, 1983.

[19] R. S. Sutton, “Learning to predict by the methods of

temporal differences,” Machine Learning, Vol. 3, pp.

9-44, 1988.