 American Journal of Computational Mathematics, 2011, 1, 183-188 doi:10.4236/ajcm.2011.13021 Published Online September 2011 (http://www.SciRP.org/journal/ajcm) Copyright © 2011 SciRes. AJCM Conditional Value-at-Risk for Random Immediate Reward Variables in Markov Decision Processes Masayuki Kageyama, Takayuki Fujii, Koji Kanefuji, Hiroe Tsubaki The Institute of Statistical Ma thematics, Tokyo, Japan E-mail: kageyama@ism.ac.jp Received May 17, 2011; revised August 10, 2011; accepted August 22, 2011 Abstract We consider risk minimization problems for Markov decision processes. From a standpoint of making the risk of random reward variable at each time as small as possible, a risk measure is introduced using condi-tional value-at-risk for random immediate reward variables in Markov decision processes, under whose risk measure criteria the risk-optimal policies are characterized by the optimality equations for the discounted or average case. As an application, the inventory models are considered. Keywords: Markov Decision Processes, Conditional Value-at-Risk, Risk Optimal Policy, Inventory Model 1. Introduction As a measure of risk for income or loss random variables, the variance has been commonly considered since Mark- owitz work . The variance has the shortcoming that it does not approximately account for the phenomenon of “fat tail” in distribution functions. In recent years, many risk measures have been generated and analyzed by an economically motivated optimization problem, for ex-ample, value at risk, conditional value-at-risk [2,3], coherent risk of measure [4-6], convex risk of measure [7,8] and its applications [9,10]. @VR@CV ROn the other hand, a lot of research considering the risk have been progressed by many authors [11-15] in the framework of Markov decision processes (MDPs, for short). In [11,16], the risk control for the random total reward in MDPs is discussed. In the sequential decision making under uncertain circumstance, it may be better to minimize the total risk through the infinite horizon con-trolling the risk at each time. For example, in multiperiod inventory and production problem, we often want to or-der optimally by the ordering policy such that while it minimizes the total risk through all the periods it also makes the risk at each time as small as possible. In this paper, with above motivation in mind we in-troduce a new risk measure for each policy using condi-tional value-at-risk for random immediate reward vari-ables, under whose risk measure criteria the optimization will be done, respectively, in the discounted and average case. As an application, the inventory model is consid-ered. In the reminder of this section, we shall establish notations that will be used throughout the paper and de-fine the problem with a new risk measure. A Borel set is a Borel subset of a complete separable metric space. For a Borel set X, X denotes the B algebra of Borel subset of X. For Borel sets X and Y, XP and XYP be the sets of all probability measures on X and all conditional probability measures on X given Y respectively. The product of X and Y is de-noted by XY. Let be the set of real numbers. Let I be a random income (or reward) variable on some probabil-ity space RP,,B, and IFx the distribution func-tion of I, i.e., I x=.PIFxx We define the inverse function 110pIFp by 1=inf .IIFpxFx p Then, the Conditional Value-at- Risk for a level 0,1 of I, @CV RI, is defined (cf. [2,3]) by  111@= d. (1) 1ICV RIFppWe note that @CV RI is specified depending only on the law of the random variable I. For any Borel set X, the set of all bounded and Borel measurable func-tions on X will be denoted by .XB A Markov decision process is a controlled dynamic system defined by a six-tuple ,,|,,,,SA AxxArQ where Borel sets S and A are state and action spaces, respectively, Ax is non-empty Borel subset of A which denotes the set of feasible actions when the system 184 M. KAGEYAMA ET AL. is in state ,xS SSAQP is the law of motion, is an immediate reward function and is an initial state distribution. rBSPSASThroughout this paper, we suppose that the set  := ,KxaSA aAxforxS  is in The sample space is the product space .SAB= such that the projections on the t-th fac-tors describe the state and the action at the t-th time of the process SA ,SA,ttX0.tLet denotes the set of all policies, i.e., for let π=01π,π,πttASASP with 00 1π,,, ,=1tt ttAxx aax for all If there is a Borel measurable function  00 1,,, ,0.tttxaaxSASt:fSA with fxAx for all xS such that 00 1,,, ,πtt ttfxxtAS00a ax 0,t,,,.ttX for all  a pol-icy is called stationary. Such a policy will be denoted by f. Let For any we assume that =11,ttx S=,HX00,,,xa a,,) ,01π,π01=(π,ππ=πt11 11,=π,tttttPr HXHXDDt (2) and 1212 ,=,==,ttttPr XHXxaxaDDQ (3) for 12,,,ASxSa AxDBDBπ and Then, for any and initial state distribution 0.t,SP the probability measure πP is given on  in an obvious way. If not specified otherwise, πP is denoted by suppressing πP in π.P We want to minimize the total reward risk making the risk at each time as small as possible. So, using for the random reward variable 11tt t at time t, a risk measure @CV R,,XrX πr for the random reward stream 11ttt will be defined in the discounted or average case as follows. With some abuse of notation, we denote by ,,:=1,2rXX t,11,,tt trXX H11,,tt trX X.1t the conditional distribution of given 1 Also, is the expectation operator w.r.t. 1.πE0< <1tHtπ.Pa) The discounted case π=111 11π:= 1 @,,.tDS ttt ttrECVRr XXH  (4) b) The average case. π=11π:= limsupTAV TtrET  111 @,,.tt ttCVRr XXH  (5) For the family of random reward streams 11,,:=1,2,:tt trXX trSASB, πDS r and πAV r have same properties as those of coherent risk measures (cf. ), which is shown in the following proposition. Proposition 1.1. For any , πDS and AV have the following 1) - 4): 1) (Monotonicity) If with 1rr212,,rr SAS B 12.rr 2) (Translation invariance) For and rSASB=,c, rc r=.crB 3) (Homogeneity) For and SAS>0, =.rr  4) (Convexity) For and SASB12,rr 01,  111r 12rr   2r. Proof. Notice that 11 111 1,,=@ ,,tt tttt ttrXX HCVRrXX H   satisfies the properties 1)-4) for For 4) with rSASB=π,AV suppose that SAS12,rr B and 01. Then, we have that 121π12=1π11=121π11=1 1π1=@1lims u p1=@lims u p 11@limsup 1AV tTtTtTtTttTtTtrrHECVRrrHTECVR rHTrHECVRrHT 1|) 21π11=1π21=112@, from convexityof@,1@limsup11@limsup=π1π,tTtTtTtTtAV AVCVRr HCV RECVRrHTECVRrHTrr  which implies (iv) with =AV . Other assertions in Proposition 1.1 are easily proved. This completes the proof. □ For ,rSASB and ,xaK, the conditional distribution function ,rDxa is defined by ,:=,,,,rDyxayxar xaQ (6) where ,, :=,,.yxarzSrxazy  Copyright © 2011 SciRes. AJCM M. KAGEYAMA ET AL. 185Lemma 1.2. For any it holds that ππ111π11111π11111 1111 @,,=@,, ,=,1 ,,,1 d,,tt tttt tttrttttrttttECVRrXXHECVRrXX XED XrXy DXyX   Q (7) where  :=max,0.xxProof. From the Markov property (3), it follows that π11 1π1111 ,,=,,|,tt tttt t ttPrXXyHPrXX X  . Thus, 11111 11 @,,=@,, ,tt tttt tttECVRrX XHECVRrX XX  . By the representation formula of @CV R (cf. [2,3]), the second equality of (7) holds, which completes the proof. □ The value function of the discounted and average cases are defined respectively by := πandinf:= πinfDS DSAV AVrrrr (8) A policy is called discounted and average risk-optimal, respectively, if *π=πDS DSrr and =π.AV AVrr 2. Risk-Optimization In this section, using for a random reward variable (1), we define a new immediate reward function by which the theory of MDPs will be easily applicable. Moreover, sufficient conditions are given for the exis-tence of discounted or average risk optimal policies. @CV R 2.1. Another Representation of Risk Measures In this subsection, another representation for DS and AV are given. For any the corresponding immediate reward function will be defined by ,rSASBrBAS111,= ,,,1 ,d,rrrxa DxarxayDxayxaQ (9) for each xS and .aA Then, we have the follow-ing, which shows that the original problem with is equivalent to the new problem with r. rTheorem 2.1. It holds that, for any , π1) π=01π=,1tDSt ttrErX 2) 1π=01π=,limsup TAVt tTtrErT .X Proof. By Lemma 1.2, it holds that for any π 1111 @,,=,,1.tt ttttECVRrXXHErX t So observing the definitions of DS and AV in (4) - (5), 1) and 2) follow, as required. □ 2.2. The Discounted Case Here, we drive the optimality equation for the discounted case, which characterizes a discount risk optimal policy. To this end, we need the following Assumption A. Assumption A. The following 1) - 4) holds: 1) A is compact and Ax is closed for each .xA 2) ,,rxay SABS is continuous in ,, .xay SAS 3) =0a,, ,yxar xQ for each ,xaK and y, where ,, =,, =.yxarzSr xazy 4) ,xaQ is strongly continuous in ,xaK, i.e., for any vSB, d,v yyxaQ is continuous in ,xaK Lemma 2.2 Suppose that Assumption A holds. Then, 1,rDxa defined in is continuous in (6),xaK for 0,1 . Proof. Let ,, <.az y0:=rzSr x,,yxa First, we prove that 1,rDxa is lower semi-continuous in ,xaK To the end, it suffices to show that :=D 1,r,xaSAD xad is closed for any .d We observe that ,xaD iff for any >0 there exists such that y,, ,yxar xaQ and .yd Now, let a sequence ,:=1,nnxa n2, be such that ,nnxaD and with n,nnaxx a,xaK. Then, for any >0, there exists a se-quence ny with ,, |,and.nnn nnnyxar xaydQ (10) Since ,rSASB there is no loss of generality in Copyright © 2011 SciRes. AJCM 186 yM. KAGEYAMA ET AL. assuming that n as for some yn .y Obviously it holds from Assumption A 2) that ,, ,,.limsup nnnnyxar yxar  (11) We show that ,, ,,.liminf nnnnyxar yxar0 (12) For any ,r r0,,,, 0 such that 10S such that ,forallSxaDD DQ.B (16) Theorem 2.4. Suppose that Assumptions A and B hold. Then, there exists vSB such that  =, d,minAV aArvxrxa vyyxaQ. (17) Moreover, there is an average risk-optimal stationary policy f such that fxA minimizes the right- hand side of (17). Proof. We have already obtained that is con-tinuous in ,rxa,xaK So, applying the theory of aver-age MDPs (cf. Corollary 3.6 in ), Theorem 2.4 fol-lows, as required. □ 3. An Application to Inventory Model We consider the single-item model with a finite capacity 0c1=ypx ay> 0hQp (19) Also, the immediate reward is given as ,,rxa where is the unit sale price, the unit production cost and unit holding cost. Several lemmas are needed for risk analysis. Let >0 be a random variable with a given demand distribution  and for ,Yx=min .x0,1Lemma 3.1. For , is given as @CVR Y @if1@=,@if1>,1pCV Ypp1CVRCVRR (20) where =px. Proof. Recall that 110dYCV RFpp1@=11.Y Since 1=YFpFp if 0.cyyWe can state the main theorem. Theorem 3.3. Suppose that Assumption C holds. Then, for each of discounted or average case, there exists a constant level stationary policy f which is optimal, that is, the ordered amount fx is if <=0ifxxxxfxxx (23) for some ,x where the critical level x for each case is given from the corresponding optimality Equa-tions (14) and (17). Proof. First we verify that 1) - 4) of Assumption A are satisfied. A 1) – A 4) are clearly true by definitions. For any 0, ,vCB from (19) it holds that d,= dvyyxavyx ayyQ, which is continuous in  ,=,0,0,xaK xaaCxxC, applying the dominated convergence theorem. We set {0}=1D.D Then, assertion (16) in Assumption B' holds. Thus, Theorems 2.3 and 2.4 are applicable. Since ,rxa is convex in , using the result of Igle-hant  (cf. ), it follows that the right-hand sides of the corresponding optimality equation (14) and (16) are convex in a0,aC.x So, it is easily shown that there exists a risk-optimal policy f of a constant level type (23) for each case. The proof is complete. □ 4. Acknowledgements This study was partly supported by “Development of Copyright © 2011 SciRes. AJCM M. KAGEYAMA ET AL. Copyright © 2011 SciRes. AJCM 188 methodologies for risk trade-off analysis toward opti-mizing management of chemicals” funded by New En-ergy and Industrial Technology Development Organiza-tion (NEDO). 5. References  H. M. Markowitz, “Portfolio Selection: Efficient Diversi-fication of Investment,” Wiley, New York, 1958.  R. T. Rockafellar and S. Uryasev, “Optimization of Con-ditional Value-at-Risk,” Journal of Risk, Vol. 2, No. 3, 2000, pp. 21-42.  R. T. Rockafellar and S. Uryasev, “Conditional Value-at- Risk for General Loss Distributions,” Journal of Banking & Finance, Vol. 26, No. 7, 2002, pp. 1443-1471. doi:10.1016/S0378-4266(02)00271-6  P. Artzner, F. Delbaen, J. M. Eber and D. Heath, “Co-herent Measure of Risk,” Mathematical Finance, Vol. 9, 1999, pp. 203-227. doi:10.1111/1467-9965.00068  A. Inoue, “On the Worst Conditional Expectation,” Journal on Applied Mathematics, Vol. 286, No. 1, 2003, pp. 237-247.  S. Kusuoka, “On Law Invariant Coherent Risk Meas-ures,” Advances in Mathematical Economics, Vol. 3, Springer, Tokyo, 2001, pp. 83-95.  H. Föllmer and I. Penner, “Convex Measures of Risk and Trading Constraints,” Finance and Stochastics, Vol. 6, No. 4, 2002, pp. 429-447. doi:10.1007/s007800200072  H. Föllmer and I. Penner, “Convex Risk Measure and the Dynamics of Their Penalty Functions,” Statistics & Deci-sion, Vol. 24, 2006, pp. 61-96.  J. Goto and Y. Takano, “Newsvendor Solutions via Con-ditional Value-at-Risk Minimization,” European Journal Operational Research, Vol. 179, No. 1, 2007, pp. 80-96. doi:10.1016/j.ejor.2006.03.022  A. Takeda, “Generalization Performance of -Support Vector Classifier Based on Conditional Value-at-Risk Minimization,” Neurocomputing, Vol. 72, 2009, pp. 2351- 2358.  B. Kang and J. A. Filar, “Time Consistent Dynamic Risk Measures,” Mathematical Methods in Operations Re-search 2005, Special Issue in Honor of Arice Hordijk 2005, pp. 1-19.  Y. Ohtsubo and K. Toyonaga, “Optimal Policy for Mini-mizing Risk Models in Markov Decision Processes,” Journal of Mathematical Analysis and Applications, Vol. 271, No. 1, 2002, pp. 66-81. doi:10.1016/S0022-247X(02)00097-5  Y. Ohtsubo, “Optimal Threshold Probability in Dis-counted Markov Decision Processes with a Target Set,” Applied Mathematics and Computation, Vol. 149, No. 2, 2004, pp. 519-532. doi:10.1016/S0096-3003(03)00158-9  D. J. White, “Minimising a Threshold Probability in Dis-counted Markov Decision Processes,” Journal of Mathe- matical Analysis and Applications, Vol. 173, No. 2, 1993, pp. 634-646. doi:10.1006/jmaa.1993.1093  C. Wu and Y. Lin, “Minimizing Risk Models in Markov Decision Processes with Policies Depending on Target Values,” Journal of Mathematical Analysis and Applica-tions, Vol. 231, No. 1, 1999, pp. 47-67. doi:10.1006/jmaa.1998.6203  A. P. Mundt, “Dynamic Risk Management with Markov Decision Processes,” Universitätsverlag Karlsruhe, Karl- sruhe, 2007.  H. L. Royden, “Real Analysis,” 2nd Edition, The Mac-millan Company, New York, 1968.  O. Hernández-Lerma and J. B. Lasserre, “Discrete-Time Markov Control Processes, Basic Optimality Criteria,” Springer-Verlag, New York, 1995.  O. Hernández-Lerma, “Adaptive Markov Control Proc-esses,” Springer-Verlag, New York, 1989.  M. Kurano, “Markov Decision Processes with a Borel Measurable Cost Function: The Average Case,” Mathe-matics of Operations Research, Vol. 11, No. 2, 1986, pp. 309-320.  D. L. Iglehant, “Optimality of (s, S) Policies in the Infi-nite Horizon Dynamic Inventory Problem,” Management science, Vol. 9, No. 2, 1963, pp. 259-267. doi:10.1287/mnsc.9.2.259  S. M. Ross, “Applied Probability Models with Optimiza-tion Applications,” Holden-Day, San Fransico, 1970.