### Paper Menu >>

### Journal Menu >>

American Journal of Computational Mathematics, 2011, 1, 183-188 doi:10.4236/ajcm.2011.13021 Published Online September 2011 (http://www.SciRP.org/journal/ajcm) Copyright © 2011 SciRes. AJCM Conditional Value-at-Risk for Random Immediate Reward Variables in Markov Decision Processes Masayuki Kageyama, Takayuki Fujii, Koji Kanefuji, Hiroe Tsubaki The Institute of Statistical Ma thematics, Tokyo, Japan E-mail: kageyama@ism.ac.jp Received May 17, 2011; revised August 10, 2011; accepted August 22, 2011 Abstract We consider risk minimization problems for Markov decision processes. From a standpoint of making the risk of random reward variable at each time as small as possible, a risk measure is introduced using condi- tional value-at-risk for random immediate reward variables in Markov decision processes, under whose risk measure criteria the risk-optimal policies are characterized by the optimality equations for the discounted or average case. As an application, the inventory models are considered. Keywords: Markov Decision Processes, Conditional Value-at-Risk, Risk Optimal Policy, Inventory Model 1. Introduction As a measure of risk for income or loss random variables, the variance has been commonly considered since Mark- owitz work [1]. The variance has the shortcoming that it does not approximately account for the phenomenon of “fat tail” in distribution functions. In recent years, many risk measures have been generated and analyzed by an economically motivated optimization problem, for ex- ample, value at risk, conditional value-at-risk [2,3], coherent risk of measure [4-6], convex risk of measure [7,8] and its applications [9,10]. @VR @CV R On the other hand, a lot of research considering the risk have been progressed by many authors [11-15] in the framework of Markov decision processes (MDPs, for short). In [11,16], the risk control for the random total reward in MDPs is discussed. In the sequential decision making under uncertain circumstance, it may be better to minimize the total risk through the infinite horizon con- trolling the risk at each time. For example, in multiperiod inventory and production problem, we often want to or- der optimally by the ordering policy such that while it minimizes the total risk through all the periods it also makes the risk at each time as small as possible. In this paper, with above motivation in mind we in- troduce a new risk measure for each policy using condi- tional value-at-risk for random immediate reward vari- ables, under whose risk measure criteria the optimization will be done, respectively, in the discounted and average case. As an application, the inventory model is consid- ered. In the reminder of this section, we shall establish notations that will be used throughout the paper and de- fine the problem with a new risk measure. A Borel set is a Borel subset of a complete separable metric space. For a Borel set X, X denotes the B algebra of Borel subset of X. For Borel sets X and Y, X P and X YP be the sets of all probability measures on X and all conditional probability measures on X given Y respectively. The product of X and Y is de- noted by XY. Let be the set of real numbers. Let I be a random income (or reward) variable on some probabil- ity space R P,,B, and I F x the distribution func- tion of I, i.e., I x =.P I Fxx We define the inverse function 1 10p I Fp by 1=inf . II F pxFx p Then, the Conditional Value-at- Risk for a level 0,1 of I, @CV RI , is defined (cf. [2,3]) by 11 1 @= d. (1) 1I CV RIFpp We note that @CV RI is specified depending only on the law of the random variable I. For any Borel set X, the set of all bounded and Borel measurable func- tions on X will be denoted by . X B A Markov decision process is a controlled dynamic system defined by a six-tuple ,,|,,,,SA AxxAr Q where Borel sets S and A are state and action spaces, respectively, A x is non-empty Borel subset of A which denotes the set of feasible actions when the system 184 M. KAGEYAMA ET AL. is in state , x S SSAQP is the law of motion, is an immediate reward function and is an initial state distribution. r B S P SAS Throughout this paper, we suppose that the set := , K xaSA aAxforxS is in The sample space is the product space . SA B= such that the projections on the t-th fac- tors describe the state and the action at the t-th time of the process SA ,SA , tt X 0.t Let denotes the set of all policies, i.e., for let π= 01 π,π, πt t A SASP with 00 1 π,,, ,=1 tt tt Axx aax for all If there is a Borel measurable function 00 1 ,,, ,0. t tt xaaxSASt : f SA with f xAx for all x S such that 00 1 ,,, ,πtt tt f xx t AS 00 a ax 0,t ,,,. tt X for all a pol- icy is called stationary. Such a policy will be denoted by f. Let For any we assume that =1 1 , tt x S =,HX 00 ,,,xa a , ,) , 01 π,π 01 =(π,π π= π t 11 11 ,=π, ttttt Pr HXHX DD t (2) and 121 2 ,=,= =, tttt Pr XHXxa xa D DQ (3) for 12 ,,, AS x Sa AxDBDB π and Then, for any and initial state distribution 0.t ,S P the probability measure π P is given on in an obvious way. If not specified otherwise, π P is denoted by suppressing π P in π.P We want to minimize the total reward risk making the risk at each time as small as possible. So, using for the random reward variable 11tt t at time t, a risk measure @CV R ,,XrX πr for the random reward stream 11ttt will be defined in the discounted or average case as follows. With some abuse of notation, we denote by ,,:=1,2rXX t , 11 ,, tt t rXX H 11 ,, tt t rX X . 1t the conditional distribution of given 1 Also, is the expectation operator w.r.t. 1.π E 0< <1 t Ht π.P a) The discounted case π =1 11 1 1 π:= 1 @,,. t DS t tt tt rE CVRr XXH (4) b) The average case. π =1 1 π:= limsup T AV Tt rE T 111 @,,. tt tt CVRr XXH (5) For the family of random reward streams 11 ,,:=1,2,: tt t rXX trSAS B, π DS r and π AV r have same properties as those of coherent risk measures (cf. [4]), which is shown in the following proposition. Proposition 1.1. For any , π D S and AV have the following 1) - 4): 1) (Monotonicity) If with 1 rr 2 12 ,,rr SAS B 12 .rr 2) (Translation invariance) For and rSAS B =,c, rc r =.c r B 3) (Homogeneity) For and SAS>0 , =.rr 4) (Convexity) For and SASB 12 ,rr 01, 1 11r 12 rr 2 r . Proof. Notice that 11 111 1 ,,=@ ,, tt tttt tt rXX HCVRrXX H satisfies the properties 1)-4) for For 4) with rSAS B =π, AV suppose that SAS 12 ,rr B and 01. Then, we have that 121 π12 =1 π11 =1 21 π11 =1 1π 1 =@1 lims u p 1 =@ lims u p 1 1@ limsup 1 AV t T t Tt T t Tt t T t Tt rrH ECVRrrH T ECVR rH T rH ECVRrH T 1 |) 21 π11 =1 π21 =1 12 @, from convexityof@, 1@ limsup 1 1@ limsup =π1π, t T t Tt T t Tt AV AV CVRr H CV R ECVRrH T ECVRrH T rr which implies (iv) with = AV . Other assertions in Proposition 1.1 are easily proved. This completes the proof. □ For ,rSAS B and , x aK, the conditional distribution function , r Dxa is defined by ,:=,,,, r Dyxayxar xa Q (6) where ,, :=,,.yxarzSrxazy Copyright © 2011 SciRes. AJCM M. KAGEYAMA ET AL. 185 Lemma 1.2. For any it holds that π π111 π1111 1 π11 1 11 11 11 @,, =@,, , =, 1 ,,, 1 d,, tt tt tt ttt rtt ttrtt tt ECVRrXXH ECVRrXX X ED X rXy DX yX Q (7) where :=max,0.xx Proof. From the Markov property (3), it follows that π11 1 π1111 ,, =,,|, tt tt tt t tt PrXXyH PrXX X . Thus, 111 11 11 @,, =@,, , tt tt tt ttt ECVRrX XH ECVRrX XX . By the representation formula of @CV R (cf. [2,3]), the second equality of (7) holds, which completes the proof. □ The value function of the discounted and average cases are defined respectively by := πand inf := π inf DS DS AV AV rr rr (8) A policy is called discounted and average risk-optimal, respectively, if * π =π DS DS rr and =π. AV AV rr 2. Risk-Optimization In this section, using for a random reward variable (1), we define a new immediate reward function by which the theory of MDPs will be easily applicable. Moreover, sufficient conditions are given for the exis- tence of discounted or average risk optimal policies. @CV R 2.1. Another Representation of Risk Measures In this subsection, another representation for D S and AV are given. For any the corresponding immediate reward function will be defined by ,rSAS B rB AS 1 1 1 ,= ,,, 1 ,d, r r rxa Dxarxay Dxayxa Q (9) for each x S and .aA Then, we have the follow- ing, which shows that the original problem with is equivalent to the new problem with r. r Theorem 2.1. It holds that, for any , π 1) π =0 1 π=, 1 t DSt t t rEr X 2) 1π =0 1 π=, limsup T AVt t Tt rEr T .X Proof. By Lemma 1.2, it holds that for any π 11 11 @,, =,,1. tt tt tt ECVRrXXH ErX t So observing the definitions of D S and AV in (4) - (5), 1) and 2) follow, as required. □ 2.2. The Discounted Case Here, we drive the optimality equation for the discounted case, which characterizes a discount risk optimal policy. To this end, we need the following Assumption A. Assumption A. The following 1) - 4) holds: 1) A is compact and A x is closed for each . x A 2) ,,rxay SA BS is continuous in ,, . x ay SAS 3) =0a,, ,yxar x Q for each , x aK and y , where ,, =,, =.yxarzSr xazy 4) , x aQ is strongly continuous in , x aK , i.e., for any vSB, d,v yyxa Q is continuous in , x aK Lemma 2.2 Suppose that Assumption A holds. Then, 1, r Dxa defined in is continuous in (6) , x aK for 0,1 . Proof. Let ,, <.az y 0:=rzSr x ,,yxa First, we prove that 1, r Dxa is lower semi-continuous in , x aK To the end, it suffices to show that :=D 1, r , x aSAD xad is closed for any .d We observe that , x aD iff for any >0 there exists such that y ,, ,yxar xa Q and .yd Now, let a sequence ,:=1, nn xa n2, be such that , nn x aD and with n , nn axx a , x aK . Then, for any >0, there exists a se- quence n y with ,, |,and. nnn nnn yxar xayd Q (10) Since ,rSAS B there is no loss of generality in Copyright © 2011 SciRes. AJCM 186 y M. KAGEYAMA ET AL. assuming that n as for some yn .y Obviously it holds from Assumption A 2) that ,, ,,. limsup nnn n y xar yxar (11) We show that ,, ,,. liminf nnn nyxar yxar 0 (12) For any ,r r 0,,,, <zyxaxaz y , so that there exists 12 ,>0 such that 1<z,,rxa 2.y Therefore, from Assumption A 2) and conver- gence assumptions there exists N for which z , nnn yxa for , which implies (12). Thus, by the general convergence theorem (cf. [17]) and (11) and (12), we have that nN ,, limsup ,,, limsu , nn n n n n yx yxarx yxar xa Q Q Q , p , ,, n n nn arxa a and 0 ,, , liminf ,,, lim inf ,,,. nnn nn n nnn n yxarxa yxarxa yxar xa Q Q Q By Assumption A 3), it holds that ,.,,, =,, lim nnn nn n y xarxa yxa QQrxa Thus, together with (10), we get ,, ,yxar xa Q and ,yd which shows that D is closed. Simi- larly, we can prove that 1 := ,, r DxaSADxad is closed for and i.e. , ,d 1, r Dxa is upper semi- continuous in , x aK This shows that 1, r Dxa is continuous in , x aK as required. □ We can be in a position to state the main theorem in the discounted case. Theorem 2.3. Suppose that Assumption A holds. Then, 1) The value function D S is given by = DS DS rhrxx d, (13) where DS hr S B is a unique solution to the op- timality equation of the discounted case, ={, d, min for . DS DS aA hrxr xahryyxa xS Q} (14) 2) The exists a measurable function : f S A with f xAx for each x S such that f x at- tains the minimum in (14) and the stationary policy f is discount risk-optimal. Proof. By Lemma 2.2, 1, r Dxa is continuous in , x aK . Thus, from the definition (9) of ,rxa and Assumption A 4), we observe that is continu- ous in ,rxa , x aK Thus, applying the theory of dis- counted MDPs (cf. Theorem 4.2.3. in [18]), the assertions of Theorem 2.3 follows. This completes the proof. □ 2.3. The Average Case In order to obtain the optimality equation for the average case, we assume that Assumption below holds, which guarantees the ergodicity of the process. Assumption B. There exists a number 0,1 such that ,,, ,, sup '' '' xx Saa A xaxa2, QQ (15) where denotes the variation norm for signed meas- ures. One of sufficient condition for Assumption B to hold, easily checked for applications, is as follows (cf. [19, 20]). Assumption B' There exists a measure on with S B >0S such that ,forall S xa DD DQ.B (16) Theorem 2.4. Suppose that Assumptions A and B hold. Then, there exists vSB such that =, d, min AV aA rvxrxa vyyxa Q. (17) Moreover, there is an average risk-optimal stationary policy f such that f xA minimizes the right- hand side of (17). Proof. We have already obtained that is con- tinuous in ,rxa , x aK So, applying the theory of aver- age MDPs (cf. Corollary 3.6 in [19]), Theorem 2.4 fol- lows, as required. □ 3. An Application to Inventory Model We consider the single-item model with a finite capacity <C , in which the demands =0 tt in successive periods are i.i.d. with the distribution function on ,=0 which has a continuous density x w.r.t. the Lebesgue measure . The state space and Copyright © 2011 SciRes. AJCM M. KAGEYAMA ET AL. 187 action space are ==0,SA C and the set of admissi- ble actions in state x S is =0, . A xCx The state t X denotes the stock level at the beginning of period t and action t is the quantity ordered (and im- mediate supplied) at the beginning of period t. Putting the amount sold during period t, t =min, tt YX t ,1,2, , the system equation is given as follows. =0 tt t 1 XX == ttt YX . tt (18) The transition probability , x aQ, for any Borel subset B of S, becomes d.y y ,= Bx aBxa rSAS B ,ca hxa >0c 1 =ypx ay > 0h Q p (19) Also, the immediate reward is given as ,,rxa where is the unit sale price, the unit production cost and unit holding cost. Several lemmas are needed for risk analysis. Let >0 be a random variable with a given demand distribution and for ,Yx =min .x 0,1 Lemma 3.1. For , is given as @CVR Y @if1 @= , @if1>, 1 p CV Ypp 1 CVR CVR R (20) where =px . Proof. Recall that 11 0d Y CV RFpp 1 @= 1 1 .Y Since 1 = Y F pF p if <pp, = x if ,pp (20) follows obviously. □ In order to the equivalent MDPs, we specify the im- mediate reward rS ASB by , =@ =,= =@ n, =@min, t rxa CV R rXxa CV Rpx acahx a pCV Rx aca hxa Lx a , mi =, x a ca (21) where min, ,Lu pCuhuu @.CV R =@V R and the third equality follows from the monotonicity and homo- geneous property of The function L defined above is proved to be a convex function. Lemma 3.2 The following 1) - 2) hold. 1) min,min,min ,.abcac bd d 2) The function Lu is convex. Proof. The proof of 1) is easy, so omitted. Noting from 1) that 12 1 min ,1min ,1uu u 212 , .uuumin , 0,1 , For any we have that 12 12 12 12 @min,1 =@min1 ,1 @min, 1min, @min, 1@min, CV Ruu CV Ruu CV Ruu CVR uCVR u . (22) The second and the third inequalities follow from the monotonicity and the convexity of , respectively. This means that @CV R Lu is convex. □ To applying Theorems 2.3 and 2.4 to inventory prob- lems, the following is needed. Assumption C. It holds that :=d>0. cyy We can state the main theorem. Theorem 3.3. Suppose that Assumption C holds. Then, for each of discounted or average case, there exists a constant level stationary policy f which is optimal, that is, the ordered amount f x is if < =0if x xxx fx x x (23) for some ,x where the critical level x for each case is given from the corresponding optimality Equa- tions (14) and (17). Proof. First we verify that 1) - 4) of Assumption A are satisfied. A 1) – A 4) are clearly true by definitions. For any 0, ,vCB from (19) it holds that d,= dvyyxavyx ayy Q, which is continuous in ,=,0,0, x aK xaaCxxC , applying the dominated convergence theorem. We set {0} = 1D.D Then, assertion (16) in Assumption B' holds. Thus, Theorems 2.3 and 2.4 are applicable. Since ,rxa is convex in , using the result of Igle- hant [21] (cf. [22]), it follows that the right-hand sides of the corresponding optimality equation (14) and (16) are convex in a 0,aC.x So, it is easily shown that there exists a risk-optimal policy f of a constant level type (23) for each case. The proof is complete. □ 4. Acknowledgements This study was partly supported by “Development of Copyright © 2011 SciRes. AJCM M. KAGEYAMA ET AL. Copyright © 2011 SciRes. AJCM 188 methodologies for risk trade-off analysis toward opti- mizing management of chemicals” funded by New En- ergy and Industrial Technology Development Organiza- tion (NEDO). 5. References [1] H. M. Markowitz, “Portfolio Selection: Efficient Diversi- fication of Investment,” Wiley, New York, 1958. [2] R. T. Rockafellar and S. Uryasev, “Optimization of Con- ditional Value-at-Risk,” Journal of Risk, Vol. 2, No. 3, 2000, pp. 21-42. [3] R. T. Rockafellar and S. Uryasev, “Conditional Value-at- Risk for General Loss Distributions,” Journal of Banking & Finance, Vol. 26, No. 7, 2002, pp. 1443-1471. doi:10.1016/S0378-4266(02)00271-6 [4] P. Artzner, F. Delbaen, J. M. Eber and D. Heath, “Co- herent Measure of Risk,” Mathematical Finance, Vol. 9, 1999, pp. 203-227. doi:10.1111/1467-9965.00068 [5] A. Inoue, “On the Worst Conditional Expectation,” Journal on Applied Mathematics, Vol. 286, No. 1, 2003, pp. 237-247. [6] S. Kusuoka, “On Law Invariant Coherent Risk Meas- ures,” Advances in Mathematical Economics, Vol. 3, Springer, Tokyo, 2001, pp. 83-95. [7] H. Föllmer and I. Penner, “Convex Measures of Risk and Trading Constraints,” Finance and Stochastics, Vol. 6, No. 4, 2002, pp. 429-447. doi:10.1007/s007800200072 [8] H. Föllmer and I. Penner, “Convex Risk Measure and the Dynamics of Their Penalty Functions,” Statistics & Deci- sion, Vol. 24, 2006, pp. 61-96. [9] J. Goto and Y. Takano, “Newsvendor Solutions via Con- ditional Value-at-Risk Minimization,” European Journal Operational Research, Vol. 179, No. 1, 2007, pp. 80-96. doi:10.1016/j.ejor.2006.03.022 [10] A. Takeda, “Generalization Performance of -Support Vector Classifier Based on Conditional Value-at-Risk Minimization,” Neurocomputing, Vol. 72, 2009, pp. 2351- 2358. [11] B. Kang and J. A. Filar, “Time Consistent Dynamic Risk Measures,” Mathematical Methods in Operations Re- search 2005, Special Issue in Honor of Arice Hordijk 2005, pp. 1-19. [12] Y. Ohtsubo and K. Toyonaga, “Optimal Policy for Mini- mizing Risk Models in Markov Decision Processes,” Journal of Mathematical Analysis and Applications, Vol. 271, No. 1, 2002, pp. 66-81. doi:10.1016/S0022-247X(02)00097-5 [13] Y. Ohtsubo, “Optimal Threshold Probability in Dis- counted Markov Decision Processes with a Target Set,” Applied Mathematics and Computation, Vol. 149, No. 2, 2004, pp. 519-532. doi:10.1016/S0096-3003(03)00158-9 [14] D. J. White, “Minimising a Threshold Probability in Dis- counted Markov Decision Processes,” Journal of Mathe- matical Analysis and Applications, Vol. 173, No. 2, 1993, pp. 634-646. doi:10.1006/jmaa.1993.1093 [15] C. Wu and Y. Lin, “Minimizing Risk Models in Markov Decision Processes with Policies Depending on Target Values,” Journal of Mathematical Analysis and Applica- tions, Vol. 231, No. 1, 1999, pp. 47-67. doi:10.1006/jmaa.1998.6203 [16] A. P. Mundt, “Dynamic Risk Management with Markov Decision Processes,” Universitätsverlag Karlsruhe, Karl- sruhe, 2007. [17] H. L. Royden, “Real Analysis,” 2nd Edition, The Mac- millan Company, New York, 1968. [18] O. Hernández-Lerma and J. B. Lasserre, “Discrete-Time Markov Control Processes, Basic Optimality Criteria,” Springer-Verlag, New York, 1995. [19] O. Hernández-Lerma, “Adaptive Markov Control Proc- esses,” Springer-Verlag, New York, 1989. [20] M. Kurano, “Markov Decision Processes with a Borel Measurable Cost Function: The Average Case,” Mathe- matics of Operations Research, Vol. 11, No. 2, 1986, pp. 309-320. [21] D. L. Iglehant, “Optimality of (s, S) Policies in the Infi- nite Horizon Dynamic Inventory Problem,” Management science, Vol. 9, No. 2, 1963, pp. 259-267. doi:10.1287/mnsc.9.2.259 [22] S. M. Ross, “Applied Probability Models with Optimiza- tion Applications,” Holden-Day, San Fransico, 1970. |