American Journal of Computational Mathematics, 2011, 1, 183-188
doi:10.4236/ajcm.2011.13021 Published Online September 2011 (http://www.SciRP.org/journal/ajcm)
Copyright © 2011 SciRes. AJCM
Conditional Value-at-Risk for Random Immediate
Reward Variables in Markov Decision Processes
Masayuki Kageyama, Takayuki Fujii, Koji Kanefuji, Hiroe Tsubaki
The Institute of Statistical Ma thematics, Tokyo, Japan
E-mail: kageyama@ism.ac.jp
Received May 17, 2011; revised August 10, 2011; accepted August 22, 2011
Abstract
We consider risk minimization problems for Markov decision processes. From a standpoint of making the
risk of random reward variable at each time as small as possible, a risk measure is introduced using condi-
tional value-at-risk for random immediate reward variables in Markov decision processes, under whose risk
measure criteria the risk-optimal policies are characterized by the optimality equations for the discounted or
average case. As an application, the inventory models are considered.
Keywords: Markov Decision Processes, Conditional Value-at-Risk, Risk Optimal Policy, Inventory Model
1. Introduction
As a measure of risk for income or loss random variables,
the variance has been commonly considered since Mark-
owitz work [1]. The variance has the shortcoming that it
does not approximately account for the phenomenon of
“fat tail” in distribution functions. In recent years, many
risk measures have been generated and analyzed by an
economically motivated optimization problem, for ex-
ample, value at risk, conditional value-at-risk
[2,3], coherent risk of measure [4-6], convex
risk of measure [7,8] and its applications [9,10].
@VR

@CV R
On the other hand, a lot of research considering the
risk have been progressed by many authors [11-15] in the
framework of Markov decision processes (MDPs, for
short). In [11,16], the risk control for the random total
reward in MDPs is discussed. In the sequential decision
making under uncertain circumstance, it may be better to
minimize the total risk through the infinite horizon con-
trolling the risk at each time. For example, in multiperiod
inventory and production problem, we often want to or-
der optimally by the ordering policy such that while it
minimizes the total risk through all the periods it also
makes the risk at each time as small as possible.
In this paper, with above motivation in mind we in-
troduce a new risk measure for each policy using condi-
tional value-at-risk for random immediate reward vari-
ables, under whose risk measure criteria the optimization
will be done, respectively, in the discounted and average
case. As an application, the inventory model is consid-
ered. In the reminder of this section, we shall establish
notations that will be used throughout the paper and de-
fine the problem with a new risk measure.
A Borel set is a Borel subset of a complete separable
metric space. For a Borel set X, X denotes the B
algebra of Borel subset of X. For Borel sets X and Y,
X
P and
X
YP be the sets of all probability
measures on X and all conditional probability measures
on X given Y respectively. The product of X and Y is de-
noted by XY. Let be the set of real numbers. Let I be
a random income (or reward) variable on some probabil-
ity space
R
P,,B, and

I
F
x the distribution func-
tion of I, i.e.,

I x
=.P
I
Fxx We define the
inverse function
1
10p
I
Fp by

1=inf .
II
F
pxFx
 p
Then, the Conditional Value-at- Risk for a level
0,1
of I,
@CV RI
, is defined (cf. [2,3]) by
 
11
1
@= d.
(1)
1I
CV RIFpp
We note that
@CV RI
is specified depending
only on the law of the random variable I. For any Borel
set X, the set of all bounded and Borel measurable func-
tions on X will be denoted by

.
X
B
A Markov decision process is a controlled dynamic
system defined by a six-tuple

,,|,,,,SA AxxAr
Q
where Borel sets S and A are state and action spaces,
respectively,
A
x is non-empty Borel subset of A
which denotes the set of feasible actions when the system
184 M. KAGEYAMA ET AL.
is in state ,
x
S
SSAQP
is the law of motion,
is an immediate reward function and
is an initial state distribution.
r
B
S
P
SAS
Throughout this paper, we suppose that the set
 

:= ,
K
xaSA aAxforxS 
is in The sample space is the product space
.
SA
B=
such that the projections on the t-th fac-
tors describe the state and the action at the t-th
time of the process

SA
,SA
,
tt
X

0.t
Let denotes the set of all policies, i.e., for
let
π=

01
π,π,

πt
t
A
SASP with

00 1
π,,, ,=1
tt tt
Axx aax
for all If there is a
Borel measurable function

 
00 1
,,, ,0.
t
tt
xaaxSASt

:
f
SA with

f
xAx
for all
x
S

such that
00 1
,,, ,πtt tt
f
xx

t
AS
00
a ax
 
0,t

,,,.
tt
X
for all a pol-
icy is called stationary. Such a policy
will be denoted by f. Let For
any we assume that
=1

1
,
tt
x S
=,HX
00
,,,xa a
,
,) ,
01
π,π
01
=(π,π
π=
π
t

11 11
,=π,
ttttt
Pr HXHX

DD
t
(2)
and


121
2
,=,=
=,
tttt
Pr XHXxa
xa

D
DQ
(3)
for

12
,,,
AS
x
Sa AxDBDB
π
and Then,
for any and initial state distribution
0.t
,S
P
the probability measure
π
P
is given on
in an
obvious way. If not specified otherwise, π
P
is denoted
by suppressing
π
P
in π.P
We want to minimize the total reward risk making the
risk at each time as small as possible. So, using
for the random reward variable 11tt t at
time t, a risk measure
@CV R

,,XrX
πr
for the random reward
stream 11ttt will be defined in
the discounted or average case as follows. With some
abuse of notation, we denote by

,,:=1,2rXX t


,

11
,,
tt t
rXX H


11
,,
tt t
rX X

.
1t
the conditional distribution of given
1 Also, is the expectation operator w.r.t.

1.π
E
0< <1
t
Ht
π.P
a) The discounted case




π
=1
11 1
1
π:= 1
@,,.
t
DS t
tt tt
rE
CVRr XXH

 


(4)
b) The average case.

π
=1
1
π:= limsup
T
AV Tt
rE
T


111
@,,.
tt tt
CVRr XXH
 
(5)
For the family of random reward streams


11
,,:=1,2,:
tt t
rXX trSAS



B,
π
DS r
and
π
AV r
have same properties as those of
coherent risk measures (cf. [4]), which is shown in the
following proposition.
Proposition 1.1. For any , π
D
S
and AV
have the following 1) - 4):
1) (Monotonicity) If with
1
rr

2
12
,,rr SAS
 B
12
.rr


2) (Translation invariance) For and

rSAS
B
=,c,


rc r

=.c

r
B
3) (Homogeneity) For and
SAS>0
,
=.rr
 

4) (Convexity) For and

SASB
12
,rr
 01,
 
1
11r
 

12
rr
 
 
 2
r
.
Proof. Notice that


11 111 1
,,=@ ,,
tt tttt tt
rXX HCVRrXX H
  


satisfies the properties 1)-4) for For 4)
with
rSAS
B
=π,
AV

 suppose that
SAS
12
,rr
 B
and 01.
Then, we have that








121
π12
=1
π11
=1
21
π11
=1
1π
1
=@1
lims u p
1
=@
lims u p
1
1@
limsup
1
AV t
T
t
Tt
T
t
Tt
t
T
t
Tt
rrH
ECVRrrH
T
ECVR rH
T
rH
ECVRrH
T
 





1
|)












21
π11
=1
π21
=1
12
@,
from convexityof@,
1@
limsup
1
1@
limsup
=π1π,
t
T
t
Tt
T
t
Tt
AV AV
CVRr H
CV R
ECVRrH
T
ECVRrH
T
rr






 


which implies (iv) with =
AV .

Other assertions in
Proposition 1.1 are easily proved. This completes the
proof.
For
,rSAS
B and

,
x
aK, the conditional
distribution function
,
r
Dxa
is defined by

,:=,,,,
r
Dyxayxar xa
Q (6)
where

,, :=,,.yxarzSrxazy 

Copyright © 2011 SciRes. AJCM
M. KAGEYAMA ET AL.
185
Lemma 1.2. For any it holds that π







π111
π1111
1
π11
1
11 11
11
@,,
=@,, ,
=,
1
,,,
1
d,,
tt tt
tt ttt
rtt
ttrtt
tt
ECVRrXXH
ECVRrXX X
ED X
rXy DX
yX
 
 

 







Q
(7)
where


:=max,0.xx
Proof. From the Markov property (3), it follows that


π11 1
π1111
,,
=,,|,
tt tt
tt t tt
PrXXyH
PrXX X
 
 


.
Thus,



111
11 11
@,,
=@,, ,
tt tt
tt ttt
ECVRrX XH
ECVRrX XX


 
 


.
By the representation formula of @CV R
(cf. [2,3]),
the second equality of (7) holds, which completes the
proof.
The value function of the discounted and average
cases are defined respectively by



:= πand
inf
:= π
inf
DS DS
AV AV
rr
rr






(8)
A policy is called discounted and average
risk-optimal, respectively, if
*
π

=π
DS DS
rr


and

=π.
AV AV
rr


2. Risk-Optimization
In this section, using for a random reward
variable (1), we define a new immediate reward function
by which the theory of MDPs will be easily applicable.
Moreover, sufficient conditions are given for the exis-
tence of discounted or average risk optimal policies.
@CV R
2.1. Another Representation of Risk Measures
In this subsection, another representation for
D
S
and
AV
are given.
For any the corresponding immediate
reward function will be defined by

,rSAS
B
rB
AS




1
1
1
,= ,,,
1
,d,
r
r
rxa Dxarxay
Dxayxa

Q
(9)
for each
x
S
and .aA
Then, we have the follow-
ing, which shows that the original problem with is
equivalent to the new problem with r.
r
Theorem 2.1. It holds that, for any , π
1)


π
=0
1
π=,
1
t
DSt t
t
rEr



X
2)


1π
=0
1
π=,
limsup T
AVt t
Tt
rEr
T
 

.X
Proof. By Lemma 1.2, it holds that for any π


11
11
@,,
=,,1.
tt tt
tt
ECVRrXXH
ErX t






So observing the definitions of
D
S
and AV
in (4) -
(5), 1) and 2) follow, as required.
2.2. The Discounted Case
Here, we drive the optimality equation for the discounted
case, which characterizes a discount risk optimal policy.
To this end, we need the following Assumption A.
Assumption A. The following 1) - 4) holds:
1) A is compact and
A
x is closed for each .
x
A
2)
,,rxay SA
BS is continuous in

,, .
x
ay SAS
3)
=0a,, ,yxar x
Q for each

,
x
aK
and
y
, where

,, =,, =.yxarzSr xazy

4)
,
x
aQ is strongly continuous in
,
x
aK
,
i.e., for any
vSB,

d,v yyxa
Q is continuous
in
,
x
aK
Lemma 2.2 Suppose that Assumption A holds. Then,
1,
r
Dxa
defined in is continuous in (6)
,
x
aK
for
0,1 .
Proof. Let


,, <.az y
0:=rzSr x

,,yxa
First, we prove that

1,
r
Dxa
is lower
semi-continuous in
,
x
aK
To the end, it suffices to
show that :=D


1,
r
,
x
aSAD xad
is closed
for any .d
We observe that

,
x
aD iff for any
>0
there exists such that y
,, ,yxar xa
Q and .yd
 Now, let a
sequence
,:=1,
nn
xa n2, be such that
,
nn
x
aD
and with
n
,
nn
axx a
,
x
aK
. Then, for any >0,
there exists a se-
quence
n
y
with
,, |,and.
nnn nnn
yxar xayd

Q (10)
Since
,rSAS
B there is no loss of generality in
Copyright © 2011 SciRes. AJCM
186
y
M. KAGEYAMA ET AL.
assuming that n as for some yn .y
Obviously it holds from Assumption A 2) that


,, ,,.
limsup nnn
n
y
xar yxar
 

(11)
We show that

,, ,,.
liminf nnn
nyxar yxar



0
(12)
For any

,r r


0,,,, <zyxaxaz y
, so that
there exists 12
,>0
such that

1<z,,rxa

2.y
Therefore, from Assumption A 2) and conver-
gence assumptions there exists N for which z

,
nnn
yxa
for , which implies (12). Thus, by
the general convergence theorem (cf. [17]) and (11) and
(12), we have that
nN





,,
limsup
,,,
limsu
,
nn n
n
n
n
yx
yxarx
yxar xa





Q
Q
Q
,
p
, ,,
n n
nn
arxa
a
and






0
,, ,
liminf
,,,
lim inf
,,,.
nnn nn
n
nnn
n
yxarxa
yxarxa
yxar xa


Q
Q
Q
By Assumption A 3), it holds that




,.,,, =,,
lim nnn nn
n
y
xarxa yxa


QQrxa
Thus, together with (10), we get


,, ,yxar xa
Q
and ,yd
 which shows that D is closed. Simi-
larly, we can prove that


1
:= ,,
r
DxaSADxad

is closed for and i.e. ,
,d
1,
r
Dxa
is upper semi-
continuous in

,
x
aK This shows that
1,
r
Dxa
is continuous in

,
x
aK as required.
We can be in a position to state the main theorem in
the discounted case.
Theorem 2.3. Suppose that Assumption A holds. Then,
1) The value function
D
S
is given by



=
DS DS
rhrxx

d,
(13)
where


DS
hr S
B is a unique solution to the op-
timality equation of the discounted case,

={, d,
min
for .
DS DS
aA
hrxr xahryyxa
xS

Q}
(14)
2) The exists a measurable function :
f
S
A with
f
xAx
for each
x
S such that
f
x
at-
tains the minimum in (14) and the stationary policy
f
is discount risk-optimal.
Proof. By Lemma 2.2,
1,
r
Dxa
is continuous in
,
x
aK
. Thus, from the definition (9) of
,rxa
and
Assumption A 4), we observe that is continu-
ous in
,rxa
,
x
aK
Thus, applying the theory of dis-
counted MDPs (cf. Theorem 4.2.3. in [18]), the assertions
of Theorem 2.3 follows. This completes the proof.
2.3. The Average Case
In order to obtain the optimality equation for the average
case, we assume that Assumption below holds, which
guarantees the ergodicity of the process.
Assumption B. There exists a number
0,1
such
that

,,,
,,
sup ''
''
xx Saa A
xaxa2,

 QQ (15)
where
denotes the variation norm for signed meas-
ures.
One of sufficient condition for Assumption B to hold,
easily checked for applications, is as follows (cf. [19,
20]).
Assumption B' There exists a measure
on
with
S
B
>0S
such that
,forall
S
xa
DD DQ.B
(16)
Theorem 2.4. Suppose that Assumptions A and B hold.
Then, there exists
vSB such that
 

=, d,
min
AV aA
rvxrxa vyyxa

Q. (17)
Moreover, there is an average risk-optimal stationary
policy
f
such that
f
xA
minimizes the right-
hand side of (17).
Proof. We have already obtained that is con-
tinuous in
,rxa
,
x
aK
So, applying the theory of aver-
age MDPs (cf. Corollary 3.6 in [19]), Theorem 2.4 fol-
lows, as required.
3. An Application to Inventory Model
We consider the single-item model with a finite capacity
<C
, in which the demands
=0
tt in successive
periods are i.i.d. with the distribution function
on
,=0
which has a continuous density
x
w.r.t. the Lebesgue measure .
The state space and
Copyright © 2011 SciRes. AJCM
M. KAGEYAMA ET AL.
187
action space are
==0,SA C and the set of admissi-
ble actions in state
x
S is

=0, .
A
xCx The
state t
X
denotes the stock level at the beginning of
period t and action t is the quantity ordered (and im-
mediate supplied) at the beginning of period t. Putting
the amount sold during period t,

t
=min,
tt
YX
t

,1,2,
,
the system equation is given as follows.

=0
tt
t
1
XX
==
ttt
YX .
tt

(18)
The transition probability

,
x
aQ, for any Borel
subset B of S, becomes



d.y y
,=
Bx aBxa

rSAS
B
 
,ca hxa 
>0c

1

=ypx ay
> 0h
Q
p
(19)
Also, the immediate reward is
given as
,,rxa
where is the unit sale price, the unit
production cost and unit holding cost. Several
lemmas are needed for risk analysis. Let
>0
be a random
variable with a given demand distribution
and
for
,Yx
=min .x

0,1
Lemma 3.1. For , is given as

@CVR Y

 
@if1
@=
,
@if1>,
1
p
CV Ypp





1
CVR
CVR
R
(20)
where
=px

.
Proof. Recall that

11
0d
Y
CV RFpp
1
@=
1

1
.Y
Since

1
=
Y
F
pF
p
if <pp, =
x
if ,pp
(20) follows obviously.
In order to the equivalent MDPs, we specify the im-
mediate reward
rS

ASB by


 




,
=@ =,=
=@ n,
=@min,
t
rxa
CV R rXxa
CV Rpx acahx a
pCV Rx aca hxa
Lx a
 


,
mi
=,
x a
ca
(21)
where


min, ,Lu pCuhuu

@.CV R
=@V R


and the
third equality follows from the monotonicity and homo-
geneous property of The function L defined
above is proved to be a convex function.
Lemma 3.2 The following 1) - 2) hold.
1)

min,min,min ,.abcac bd 
d
2) The function
Lu is convex.
Proof. The proof of 1) is easy, so omitted. Noting from
1) that

12 1
min ,1min ,1uu u
 
 
212
, .uuumin ,

0,1 ,
For any we have
that
 






 

12
12
12
12
@min,1
=@min1 ,1
@min, 1min,
@min, 1@min,
CV Ruu
CV Ruu
CV Ruu
CVR uCVR u


 
 


 

 .
(22)
The second and the third inequalities follow from the
monotonicity and the convexity of , respectively.
This means that
@CV R
Lu is convex.
To applying Theorems 2.3 and 2.4 to inventory prob-
lems, the following is needed.
Assumption C. It holds that

:=d>0.
cyy

We can state the main theorem.
Theorem 3.3. Suppose that Assumption C holds. Then,
for each of discounted or average case, there exists a
constant level stationary policy
f
which is optimal,
that is, the ordered amount
f
x
is

if <
=0if
x
xxx
fx
x
x
(23)
for some ,x
where the critical level
x
for each
case is given from the corresponding optimality Equa-
tions (14) and (17).
Proof. First we verify that 1) - 4) of Assumption A are
satisfied. A 1) – A 4) are clearly true by definitions. For
any
0, ,vCB from (19) it holds that





d,= dvyyxavyx ayy


Q,
which is continuous in
 

,=,0,0,
x
aK xaaCxxC
,
applying the dominated convergence theorem. We set
{0}
=

1D.D
Then, assertion (16) in Assumption
B' holds. Thus, Theorems 2.3 and 2.4 are applicable.
Since
,rxa is convex in , using the result of Igle-
hant [21] (cf. [22]), it follows that the right-hand sides of
the corresponding optimality equation (14) and (16) are
convex in
a
0,aC.x
So, it is easily shown that
there exists a risk-optimal policy
f
of a constant level
type (23) for each case. The proof is complete.
4. Acknowledgements
This study was partly supported by “Development of
Copyright © 2011 SciRes. AJCM
M. KAGEYAMA ET AL.
Copyright © 2011 SciRes. AJCM
188
methodologies for risk trade-off analysis toward opti-
mizing management of chemicals” funded by New En-
ergy and Industrial Technology Development Organiza-
tion (NEDO).
5. References
[1] H. M. Markowitz, “Portfolio Selection: Efficient Diversi-
fication of Investment,” Wiley, New York, 1958.
[2] R. T. Rockafellar and S. Uryasev, “Optimization of Con-
ditional Value-at-Risk,” Journal of Risk, Vol. 2, No. 3,
2000, pp. 21-42.
[3] R. T. Rockafellar and S. Uryasev, “Conditional Value-at-
Risk for General Loss Distributions,” Journal of Banking
& Finance, Vol. 26, No. 7, 2002, pp. 1443-1471.
doi:10.1016/S0378-4266(02)00271-6
[4] P. Artzner, F. Delbaen, J. M. Eber and D. Heath, “Co-
herent Measure of Risk,” Mathematical Finance, Vol. 9,
1999, pp. 203-227. doi:10.1111/1467-9965.00068
[5] A. Inoue, “On the Worst Conditional Expectation,”
Journal on Applied Mathematics, Vol. 286, No. 1, 2003,
pp. 237-247.
[6] S. Kusuoka, “On Law Invariant Coherent Risk Meas-
ures,” Advances in Mathematical Economics, Vol. 3,
Springer, Tokyo, 2001, pp. 83-95.
[7] H. Föllmer and I. Penner, “Convex Measures of Risk and
Trading Constraints,” Finance and Stochastics, Vol. 6,
No. 4, 2002, pp. 429-447. doi:10.1007/s007800200072
[8] H. Föllmer and I. Penner, “Convex Risk Measure and the
Dynamics of Their Penalty Functions,” Statistics & Deci-
sion, Vol. 24, 2006, pp. 61-96.
[9] J. Goto and Y. Takano, “Newsvendor Solutions via Con-
ditional Value-at-Risk Minimization,” European Journal
Operational Research, Vol. 179, No. 1, 2007, pp. 80-96.
doi:10.1016/j.ejor.2006.03.022
[10] A. Takeda, “Generalization Performance of
-Support
Vector Classifier Based on Conditional Value-at-Risk
Minimization,” Neurocomputing, Vol. 72, 2009, pp. 2351-
2358.
[11] B. Kang and J. A. Filar, “Time Consistent Dynamic Risk
Measures,” Mathematical Methods in Operations Re-
search 2005, Special Issue in Honor of Arice Hordijk
2005, pp. 1-19.
[12] Y. Ohtsubo and K. Toyonaga, “Optimal Policy for Mini-
mizing Risk Models in Markov Decision Processes,”
Journal of Mathematical Analysis and Applications, Vol.
271, No. 1, 2002, pp. 66-81.
doi:10.1016/S0022-247X(02)00097-5
[13] Y. Ohtsubo, “Optimal Threshold Probability in Dis-
counted Markov Decision Processes with a Target Set,”
Applied Mathematics and Computation, Vol. 149, No. 2,
2004, pp. 519-532. doi:10.1016/S0096-3003(03)00158-9
[14] D. J. White, “Minimising a Threshold Probability in Dis-
counted Markov Decision Processes,” Journal of Mathe-
matical Analysis and Applications, Vol. 173, No. 2, 1993,
pp. 634-646. doi:10.1006/jmaa.1993.1093
[15] C. Wu and Y. Lin, “Minimizing Risk Models in Markov
Decision Processes with Policies Depending on Target
Values,” Journal of Mathematical Analysis and Applica-
tions, Vol. 231, No. 1, 1999, pp. 47-67.
doi:10.1006/jmaa.1998.6203
[16] A. P. Mundt, “Dynamic Risk Management with Markov
Decision Processes,” Universitätsverlag Karlsruhe, Karl-
sruhe, 2007.
[17] H. L. Royden, “Real Analysis,” 2nd Edition, The Mac-
millan Company, New York, 1968.
[18] O. Hernández-Lerma and J. B. Lasserre, “Discrete-Time
Markov Control Processes, Basic Optimality Criteria,”
Springer-Verlag, New York, 1995.
[19] O. Hernández-Lerma, “Adaptive Markov Control Proc-
esses,” Springer-Verlag, New York, 1989.
[20] M. Kurano, “Markov Decision Processes with a Borel
Measurable Cost Function: The Average Case,” Mathe-
matics of Operations Research, Vol. 11, No. 2, 1986, pp.
309-320.
[21] D. L. Iglehant, “Optimality of (s, S) Policies in the Infi-
nite Horizon Dynamic Inventory Problem,” Management
science, Vol. 9, No. 2, 1963, pp. 259-267.
doi:10.1287/mnsc.9.2.259
[22] S. M. Ross, “Applied Probability Models with Optimiza-
tion Applications,” Holden-Day, San Fransico, 1970.