Communications and Network, 2013, 5, 86-89
doi:10.4236/cn.2013.51B020 Published Online February 2013 (http://www.scirp.org/journal/cn)
Topological Order Value Iteration Algorithm for Solving
Probabilistic Planning*
Xiaofei Liu1, Mingjie Li2, Qingxin Nie3
1Polytechnic School, SanYa University, Sanya, China
2School of Foundation Courses, SanYa University, Sanya, China
3Department of Computer Science, DunHua Vocational-technical School, Dunhua,China
Received 2012
ABSTRACT
AI researchers typically formulated probabilistic planning under uncertainty problems using Markov Decision Proc-
esses (MDPs).Value Iteration is an inefcient algorithm for MDPs, because it puts the majority of its effort into backing
up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many
approaches have been proposed. Among them, LAO*, LRTDP and HDP are state-of-the-art ones. All of these use
reachability analysis and heuristics to avoid some unnecessary backups. However, none of these approaches fully ex-
ploit the graphical features of the MDPs or use these features to yield the best backup sequence of the state space. We
introduce an improved algorithm named Topological Order Value Iteration (TOVI) that can circumvent the problem of
unnecessary backups by detecting the structure of MDPs and backing up states based on topological sequences. The
experimental results demonstrate the effectiveness and excellent performance of our algorithm.
Keywords: Probabilistic Planning; Markov Decision Processes; Dynamic Programming; Value Iteration
1. Introduction
In recent years, intelligent planning has developed into
an important branch in artificial intelligence research,
especially the uncertainty planning problem has aroused
the researcher’s more attention. Among a large number
of research methods, probabilistic methods can be more
accurate to describe the uncertainty information, so it has
been widespread concerned in the research, the solving
method has been gradually matured. Probabilistic plan-
ning uses probability distribution to describe the uncer-
tainty of the initial world state and the effects of actions.
In 2004, the IPC-4(2004 International Planning Competi-
tion) especially increased the competition in probabilistic
planning domains; it has showed that the research of
probabilistic planning is very important in the field of
intelligent planning study.
Markov decision processes (MDPs) is a model for
representing probabilistic planning problems. Value it-
eration and policy iteration are two fundamental dynamic
programming algorithms for solving MDPs [1]. However,
these two algorithms are sometimes inefficient. They
spend too much time backing up states, often redundantly.
Recently several types of algorithms have been proposed
to efficiently solve MDPs. The rst type uses reachability
information and heuristic functions to omit some unnec-
essary backups, such as RTDP [2], LAO* [3], LRTDP [4]
and HDP [5]. The second uses some approximation me-
thods to simplify the problems. The third aggregates
groups of states of an MDP by features, represents them
as factored MDPs and solves the factored MDPs. Often
the factored MDPs are exponentially simpler, but the
strategies to solve them are tricky, sLAO* [6], sRTDP [7]
are examples. One can use prioritization to decrease the
number of inefcient backups. Faster dynamic program-
ming [8] and ranking policies in discrete Markov Deci-
sion Processes [9] are two recent examples.
In this paper we propose an improvement of the value
iteration algorithm named Topological Order Value It-
eration which combines the rst and last technique. It
decompose a MDP into strong topological order con-
nected components, and then using value iteration algo-
rithm to solve the components in order, so it can prevent
the calculation of a large number of useless states and
make the available states arranged orderly. It does back-
ups in the best order and only when necessary. Topo-
logical Order Value Iteration is itself not a heuristic algo-
rithm, but it can efciently make use of extant heuristic
functions to initialize value functions.
2. Background
*Project supported by Sanya City Centre to scientific and technological
cooperation Foundation (Grant No 2011YD44). 2.1. Markov Decision Processes
Copyright © 2013 SciRes. CN
X. F. LIU ET AL. 87
AI researchers typically use MDPs to formulate prob-
abilistic planning problems. An MDP is dened as a
four-tuple<S,A,T,C>, where S is a discrete set of states,
A is a nite set of all applicable actions, T is the transi-
tion matrix describing the domain dynamics, and C de-
notes the cost of action transitions. The agent executes its
actions in discrete time steps called stages. At each stage,
the system is at one distinct states S. The agent can
pick any action a from a set of applicable action Ap(s)
A, incurring a cost of C(s, a). The action takes the sys-
tem to a new state s stochastically, with probability Ta
(s|s) .
The horizon of an MDP is the number of stages for
which costs are accumulated. There are a set of sink goal
states GS, reaching which terminates the execution. To
solve the MDP we need to nd an optimal policy (SA),
a probabilistic execution plan that reaches a goal state
with the minimum expected cost. Any optimal policy
must satisfy the following system of Bellman equations,
the value function of a policy π is dened as:
'
()(,())('| )('),[0,1]
sS
Vs CssTssVs
 
 
*
(1)
and the optimal value function is dened as:
*
() '
()min[(, )('| )(')],[0,1]
aAs sS
VsCsa TssVs

(2)
2.2. Dynamic Programming
Most optimal MDP algorithms are based on dynamic
programming. Its usefulness was rst proved by a simple
yet powerful algorithm named value iteration [10].Value
iteration rst initializes the value function arbitrarily. Its
basic idea is to iteratively update the value functions of
every state until they converge. And in each iterm, the
value function is updated according to Equation 2. We
call one such update a Bellman backup. The Bellman
residual of a state s is dened to be the difference be-
tween the value functions of s in two consecutive itera-
tions. The Bellman error is dened to be the maximum
Bellman residual of the state space. When this Bellman
error is less than some threshold value, we conclude that
the value functions have converged sufciently.
The main drawback of the value functions algorithm is
that, and in each iterm, the value functions of every state
are updated, which is highly unnecessary. Firstly, some
states are backed up before their successor states, and
often this type of backup is fruitless. Secondly, different
states converge with different rates. When only a few
states are not converged, we may only need to back up a
subset of the state space in the next iteration.
3. Topological Order Value Iteration
We have studied the sequence of state backups according
to an MDP’s graphical structure, which is the intrinsic
property of an MDP and potentially decides the com-
plexity of solving it [11]. Our first observation is that
states and their value functions are causally related. If in
an MDP M, one state s is a successor state of s after ap-
plying action a, then V (s) is dependent on V (s). For this
reason, we want to back up s ahead of s. The causal rela-
tion is transitive.
Topological Order Value Iteration solves an MDP
problem by using the problem’s graphical structure
wisely. Given an MDP, TOVI first builds a directed rea-
chability graph Gsr, where G has one vertex per state s
S. A directed edge from vertex s1 to s2 exists if there
is an action such that Ta (s2|s1) > 0. TOVI then finds all
the strongly connected components of Gsr, and the topo-
logical order of the components. Then, it solves every
connected component individually, by value iteration,
according to their topological order. Figure 1 shows the
graphical representation of one simple MDP that has 7
states and 12 actions. In the figure, successors of prob-
abilistic actions are connected by an arc. For simplicity
reason, transition probabilities Ta and costs C(s, a) are
omitted. Using TOVI, we can divide the MDP into two
connected components C1 and C2. Based on the remain-
ing actions, C1 and C2 can be subdivided into three and
two smaller components respectively. By decomposing
an MDP into smaller components, TOVI’s convergence
can be much faster than VI.
We use Kosaraju’s algorithm of detecting the topo-
logical order of strongly connected components in a di-
rected graph [12]. Note that Bonet and Geffner used Tar-
jan’s algorithm in detection of strongly connected com-
ponents in a directed graph in their solver [5], but they do
not use the topological order of these components to sys-
tematically back up each component of an MDP. Kosa-
raju’s algorithm is simple to implement and its time
complexity is only linear in the number of states, so
when the state space is large, the overhead in ordering
the state backup sequence is acceptable. Our experimen-
tal results also demonstrate that the overhead is well
compensated by the computational gain.
Figu re 1. A simplified MDP and its set of strongly conne cted
components.
Copyright © 2013 SciRes. CN
X. F. LIU ET AL.
88
The pseudo code of TOVI is shown in Algorithm 1.
We first use Kosaraju’s algorithm to find the set of
strongly connected components C in graph Gsr, and their
sequential order. Note that each c C maps to a set of
states in M. We then use value iteration to solve each c.
Since there are no cycles in those components, we only
need to solve them once.
Algorithm 1 Topological Order Value Iteration
1: Input:an MDP <S,A,T,C>
VI(S: a set of states, δ)
2: while (true)
3: for each state s S
4:
() '
()min[(, )('| ) (')]
aAps sS
VsCsaTs sVs

5: if (Bellman error is less than δ)
6: return
Scc(MDP M)(Kosaraju’s algorithm)
7: build the graph Gsr
8: compute the strongly connected components of
Gsr,order them by topological order C1,...,Ck
9: for c1 to k do
10: solve component Cc by value iteration
Search(s)
11: if sG then mark s as visited
12:
arg min(,)
a
aQsa
13: for every unvisited successor sof action a do
14: Search(s)
15: Back-up(s)
16: for each action a do
17: '
Q(,)( , )( '|)( ')
l
sS
s
aCas TssVs

18: if Q(s,a)>Vu(s) then eliminate a from Ap(s)
19:
()
()min(, )
laAps
Vs Qsa
20:
()'
'
()min[(, )('| )(')]
uaAps au
sS
VsCsaT ssVs

4. Experiment
We tested the Topological Order Value Iteration and
compared its running time against value iteration (VI),
LAO*, and LRTDP. All the algorithms are coded in C
and properly optimized, and run on the same Intel Core2
Duo CPU E7400 2.80GHz processor with 4G main
memory. The operating system is Linux version 2.6.15
and the compiler is gcc version 3.3.4.
We use seven MDP test domains for our experiments.
They are Mountain Car, Single Arm Pendulum, Wet-
floor2, and three domains from International Planning
Competition 2006–Drive, Elevators and TireWorld. The
Performance of the different algorithms in various test
domains is listed in Table 1. All running times are in
seconds, fastest times are bolded. BC size means the size
of the biggest connected component. “-” means that the
algorithm failed to solve the problem within 5 minutes.
The experimental results has showed TOVI algorithm
Table 1. Running time of different algorithms in various
domains.
TOVI
Test
Domain VI
Time LAO*
Time LRTDP
Time BC sizeTime
MCar100
MCar300
MCar700
1.91
229.70
-
1.23
93.91
216.01
2.81
117.23
-
7,799
71,751
390,191
0.68
23.22
233.98
SAP100
SAP300
SAP500
9.39
-
-
1.81
32.4
130.7
2.58
-
-
9,999
89,999
-
2.37
44.2
-
WF200
WF400
-
-
11.22
98.97
22.08
97.73
39,999
159,999
10.58
90.78
DAP10
DAP20
51.45
-
3.04
144.12
1.02
32.68
9,454
150,489
0.75
21.95
Drive - - - 75,84074.70
Ele(p13)
Ele(p15)
-
-
227.53
27.53
-
-
1053
1053
58.46
14.59
TireW(5)
TireW(6)
0.14
0.16
0.01
0.01
1.26
1.44
23
618,448
2.26
48.81
is better than the other three algorithms on most of do-
mains, it can fast convergence due to only update the
appropriate path to calculate the sequence to avoid a
large number of useless state calculations. However, in
Single Arm Pedulum and TireWorld test domains, TOVI
algorithm grouped the state diagram and ordered each
connected component has speeded more time in the
overall running time, so its performance less than the
LAO * algorithm and LRTDP.
5. Conclusions
We have introduced and analyzed a probabilistic plan-
ning MDP solver, Topological Order Value Iteration that
studies the dependence relation of the value functions of
the state space and use the dependence relation to decide
the sequence to back up states. The algorithm is based on
the idea that different MDPs have different graphical
structures, and the graphical structure of an MDP intrin-
sically determines the complexity of solving that MDP.
We notice that no current solvers detect this information
and use it to guide state backups. Thus, they solve MDPs
of the same problem sizes but with different graphical
structure with almost the same strategies. In this sense,
they are not “intelligent”. Topological Order Value Itera-
tion is proposed to solve this problem. It is guaranteed to
find the optimal solution of a Markov decision process
sequentially.
Topological Order Value Iteration also is a flexible
algorithm, which can use the initial state information and
apply reachability analysis. Our results have shown that
TOVI is extremely useful in MDPs with many connected
components. The complexity increase of TOVI is not as
great as other algorithms as the number of layers increase,
Copyright © 2013 SciRes. CN
X. F. LIU ET AL.
Copyright © 2013 SciRes. CN
89
which shows that TOVI is very suitable for solving
MDPs with layered structures.
REFERENCES
[1] S. Y. Yan, M. H. Yin, W. X. Gu and X. F. Liu, “Research
and Advances on Probabilistic Planning,” Caai Transac-
tions on Intelligent Systems, Vol. 1, 2008, pp. 9-22.
[2] A. Barto, S. Bradke and S. Singh, “Learning to Act using
Real-time Dynamic Programming,” Artificial Intelligence,
Vol. 72, 1995, pp. 81-138.
doi:10.1016/0004-3702(94)00011-O
[3] E. Hansen and S. Zilberstein, “LAO*: A Heuristic Search
Algorithm that Finds Solutions Withloops,” Artificial In-
telligence, Vol. 129, 2001, pp. 35-62.
doi:10.1016/S0004-3702(01)00106-0
[4] B. Bonet and H. Geffner, “Labeled RTDP: Improving the
Convergence of Real-time Dynamic Programming,” Pro-
ceedings of 13th ICAPS, 2003, pp. 12-21.
[5] B. Bonet and H. Geffner, “Faster Heuristic Search Algo-
rithms for Planning with Uncertainty and Full Feedback,”
Proceedings of IJCAI-03, 2003, pp. 1233-1238.
[6] C. Guestrin, D. Koller, R. Parr and S. Venkataraman,
“Efcient Solution Algorithms for Factored MDPs,”
Journal of Artificial Intelligence Research, Vol. 19, 2003,
pp. 399-468.
[7] Z. Feng and E. Hansen, “Symbolic Heuristic Search for
Factored Markov Decision Processes,” In Proceedings of
AAAI-05, 2002, pp. 44-50.
[8] P. Dai, Mausam and S. Daniel, “Focused Value Itera-
tion,” The Nineteenth International Conference on Auto-
mated Planning and Scheduling (ICAPS-09), 2009, pp.
82-89.
[9] P. Dai and J. Goldsmith, “Ranking Policies in Discrete
Markov Decision Processes,” Annals of Mathematics and
Artificial Intelligence, Vol. 59, 2010, pp. 107-123.
doi:10.1007/s10472-010-9216-8
[10] M. Pterman and Markov, “Decision Processes: Discrete
Stochastic Dynamic Programming,” Wiley-Interscience,
2005.
[11] M. Littman, T. Dean and P. Kaelbling, “On the Complex-
ity of Solving Markov Decision Problems,” In Proceed-
ings of UAI-95, 1995, pp. 394-402.
[12] H. Cormen, C. Leiserson and R. Rivest, “Introduction to
Algorithms,” Second Edition, The MIT Press, 2001.