Structural Properties of Optimal Scheduling Policies for Wireless Data Transmission

doi:10.4236/ijcns.2012.510069

Paper Menu >>

Journal Menu >>

Int. J. Communications, Network and System Sciences, 2012, 5, 671-677

http://dx.doi.org/10.4236/ijcns.2012.510069 Published Online October 2012 (http://www.SciRP.org/journal/ijcns)

Structural Properties of Optimal Scheduling Policies for

Wireless Data Transmission

Nomesh Bolia1, Vidyadhar Kulkarni2

1Department of Mechanical Engineering, Indian Institute of Technology, Delhi, India

2Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, USA

Email: nomesh@mech.iitd.ac.in, vkulkarn@email.unc.edu

Received August 15, 2012; revised September 12, 2012; accepted October 8, 2012

ABSTRACT

We analyze a cell with a fixed number of users in a time period network. The base station schedules to serve at most

one user in a given time period based on information about the available data rates and other parameter(s) for all the

users in the cell. We consider infinitely backlogged queues and model the system as a Markov Decision Process (MDP)

and prove the monotonicity of the optimal policy with respect to the “starvation age” and the available data rate. For

this, we consider both the discounted as well as the long-run average criterion. The proofs of the monotonicity proper-

ties serve as good illustrations of analyzing MDPs with respect to their optimal solutions.

Keywords: MDP; Scheduling; Structural Properties

1. Introduction

We consider a fixed set of mobile data users in a

wireless cell served by a single base station and focus on

the downlink channel. The base station maintains a sepa-

rate queue of data for each user. Time is slotted and in

each slot (time period in the standard MDP terminology)

the base station can transmit data to exactly one user. Let

be the channel rate of user

during time period , i.e., the amount of data that

can be transmitted to user during time period by

the base station. We assume that the base station knows

at all time periods the vector 12 N.

How this information is gathered depends on the system

in use. An example of a resource allocation system

widely known and used in practice is the CDMA2000

1xEV-DO system [1]. A good description of how this

information is generated is also provided in [1]. A good

framework for resource allocation and related issues in

this (and more general) setting can be found in [2].

=1,n

n,,,

nnn n

RRR R



, ;N



=1,2,

There are two objectives to be fulfilled while schedul-

ing the data transfer. The first is to obtain a high data

transfer rate. This can be achieved by serving a user

in period whose channel rate u is the highest, i.e.,

following a myopic policy. However if we follow the

myopic policy, we run the risk of severely starving users

whose channel rate is low for a long time. The second

objective is to ensure that none of the users is severely

starved. Thus these are conflicting objectives and any

good algorithm tries to achieve a “good” balance be-

tween the two. We have proposed MDP based scheduling

policies in [3] to achieve this balance. In this paper, for

the sake of completeness we first describe the MDP

framework (and our heuristic policies) and then analyze

the important monotonicity properties of the (MDP-) op-

timal and our recommended policies.

Literature Survey

This problem of scheduling users for data transmission in

a wireless cell has been considered in the literature

mostly in the last decade and a half. One of the most

widely used algorithms that takes advantage of multiuser

diversity (users having different and time-varying rates at

which they can be served data) while at the same time

being fair to all users is the Proportional Fair Algorithm

(PFA) of Tse [4]. When each user always has data to be

served waiting at the base station (infinitely backlogged

queues), the PFA performs well and makes good use of

the multiuser diversity. However, it has been proven to

be unstable when data isn’t always available to be served

to each user, and instead, there is external data arrival [5].

Most of the algorithms in this setting are not necessarily

outcomes of any optimization framework. In our earlier

publication [3], we take a novel approach to solving this

problem. This approach develops the scheduling algo-

rithm as an outcome of a systematic optimization frame-

work. Therefore, Bolia and Kulkarni [3] develop MDP

and policy improvement based scheduling policies.

These policies are easy to implement, and shown to per-

N. BOLIA, V. KULKARNI

672

form better [3] than existing policies.

However, while our recommended policies despite

being sub-optimal exhibit better results than existing po-

licies [3], our past work lacks any results about structural

properties of both the recommended as well as the opti-

mal policies. We believe it is important to establish some

such properties to either gain further insight into the

problem. Therefore, in this correspondence we prove

some mono tonicity properties of the optimal policies and

the policies proposed in [3]. We first define and describe

these monotonicity properties in the paper. Our contribu-

tions are two fold: A rigorous analysis of these properties

along with the observation that our recommended policy

in [3] is also monotone (thus being in line with the opti-

mal policy at least with respect to some basic properties)

and an illustration of analysis of optimal policies in the

MDP framework. These ideas can serve as a good start-

ing point and provide broad guidelines to analyze struc-

tural properties of MDPs.

The rest of the article is organized as follows: Section

2 describes the model and index policy and Section 3

proves monotonicity of the optimal policy in this setting.

Section 3.3 extends the results for these properties to the

long-run average criterion. We conclude the paper with

remarks on possible extensions in Section 4.

2. The Model

In this section, for the sake of completeness, we start

with a description of the stochastic model [3] for the

multidimensional stochastic process



that

represents the channel rates of all users in the cell. Let

Rn

be the channel state of user u at time . This

represents various factors such as the position of the user

in the cell, the propagation conditions, etc. and deter-

mines the channel rate of user u as described

below. We assume that



is an irreducible

Discrete Time Markov chain (DTMC) on state space





=1,2,

Xn







=11

with Transition Probability Matrix

(TPM) . Note that the TPM can in general



be different for different users. Further, as indicated in

[1], a set of

fixed data rates is what is available

to users in an actual system. For each , let

k be the fixed data rate (or channel rate) associated

with state of the DTMC u. Thus,

when u

=1,2, ,uN

k



Xn

:Rn

, the user can receive data from the

base station at rate if it is chosen to be served.

Thus is a Markov Chain with state space









12 , i.e., the vector of all fixed data rates.

We assume, without loss of generality, that

=,rr,r,

rr r. Let 1N

=,,

XX n







 be the state

vector of all the users. We assume the users behave

independently of each other and that each user has ample

data to ber served. This setting where each user always

has ample data to be served is referred to as the “infi-

nitely backlogged queues setting”. Since each component







Xn is an independent DTMC on



, it is

clear that





XnN



Yu nu

itself is a DTMC on .

Let u be the “starvation age” (or simply “age”) of

the user at time , defined as the time elapsed (in

number of periods) since the user was served most

recently. Thus, the age of the user is zero at time



if it is served in the time period. Furthermore, for

, if the user was served in time period and it is

not served for the next time periods, its age at time

1mn



is 1m



. Let 1N



be the age

vector (vector of ages of all users) at time . The base

station serves exactly one user in each time period. Let

=,,

nn n









vn th



11if ,

=0if =.

Yuvn











nnN N

be the user served in the time period. The age

process evolves according to

(1)

The “state of the system” at time is given by



 



=0,1,2,Z

 , where . The “state”

is thus a vector of components and we assume that

it is known at the base station in each time period. After

observing







 the base station decides to serve

one of the users in the time period . We need a

reward structure in order to make this decision optimally.

We describe such a structure below. If we serve user

in the time period, we earn a reward equal to



Dy ly

for this user and none for the others. In

addition, there is a cost of l if user of age

is not served in period . This cost corresponds to the

penalty incurred due to “starvation” of the user(s) not

served in a given time period. Clearly, we can assume





0=0Du n



ull

RDY





l since there is no starvation at age zero. Thus

the net reward of serving user at time is



(2)

We assume that there is no cost in switching from one

user to another from period to period. This is not entirely

true in practice, but including switching costs in the

model will make the analysis intractable. For conven-

ience we use the notation . The pro-



ull

WDY





blem of scheduling a user in a given time period can now

be formulated as a Markov Decision Process (MDP). The

decision epochs are





1, 2,n

nn . The state at time is



Y



 with Markovian evolution as described above.

The action space in every state is



1, 2,,

N

u u

where action corresponds to serving the user . The

reward in state







u

RWu

corresponding to action is

For the sake of notational convenience, let and





Wt be defined as follows:

N. BOLIA, V. KULKARNI 673



=1, ,1,0,

uuu

tt tt







1, ,1

t (3)

and,





= .

ull





()Wt (4)

Let



be the discounting rate for the MDP [6]. Then,

the standard Bellman equation for the discounted reward

model is

 

=1,2, ,

max u

VitrW t









, ,

hit





 

(5)

where







Nij

pVjt







,deci tA



,=hit (6)

Let be the optimal decision made (i.e.,

the user served) in state



,it





hit



. Then,

 

=1,2, ,

,=arg

max u

deci trWt















hit











.

Further, let

 

=1,2, ,

,=arg

max u

deci trWt



(7)

be the optimal decision at the step of the value it-

eration scheme given by (8).

We use the following notation: For any real valued

function



it NN

defined on



,

t denotes

that

decreases in every component of . t

3. Monotonicity of Optimal Policy

Although solving Equation (5) to optimality is infeasible,

we can derive some important characteristics of the op-

timal policy. In this section, we consider two monotonic-

ity properties of the optimal policy. We first consider

monotonicity in age.

3.1. Monotonicity in Age

The intuition behind monotonicity is as follows. The pen-

alty accrued for each user in a given time period is an

increasing function of its current age. Hence we expect

the propensity of the optimal policy serving any given

user to increase with its age, i.e., if the optimal policy

serves a user in the state





,it u, it will serve user

in state





,iteN



,Vitt

 



1=1,2, ,

,=, ,

max

kiuk

VitrWt hit





e as well, where u denotes an -

dimensional vector with the u component 1 and all

other components 0.

Theorem 3.2 states and proves this monotonicity prop-

erty of the optimal policy for discounted reward. Then

we show that standard MDP theory [6] implies the result

holds in the case of long-run average reward as well.

We will need the following result to prove theorem

3.2.

Theorem 3.1 .

Proof. The standard value iteration equations of (5) are

given by













 (8)

where









,= ,,

kijk

hitpV jt



(9)





0,=0Vit . We have and













,=,,,.

lim NN

kVitVititZ

   (10)



We will prove



,t tk

Vi using induction on .

Then the theorem follows from the above equation.





,t t=0k

Vi holds at since

Note that





,=0Vit

0. Assume





,t t0

k

Vi for some . We

prove





Vitt

. It is enough to prove that



 

111

,,0,

VitVite







ht 

(11)

since the proof for all components other than 1 follows

similarly. Note that Vt . We consider four

cases:



Case 1:



,=1

deci t



,=1

deci te

and . From

(8),











 



111

11 1

=(),

,0,

VitVite

rWt hit

r Wtehite















(12)













since



Wte Wt



te t

vv and using Equa-

tions (3) and (4).



Case 2:



,=1deci t



,=1

deci teu

k and 1.

From (8), and using





Wte Wt



tet eu and

, we have













 



 

 



 

111

,,0.

iu k

iiu

VitVite

rWt hit

r Wtehite

rr WtWt

hithite

rr WtWt

hit hit

















 







 











 









ht



,=1decit

(13)

The second inequality holds because k and the

last inequality holds because .

Case 3:





,= 1

deci tu



and

From (8),



,=.

deci teu

N. BOLIA, V. KULKARNI

674

 





 

iuk

VitVite

rWthi

r Wteh

Wte Wt

hit hit









 



















ite

















,=i t



,=.i teu





]

ite



























,=deci tu



,tt

kt

 

, =

i tev



,, .

ituA

(, )=

i tev





v v

Wte















e

(14)

using the same arguments as in Case 2.

Case 4: k and k

From (8), and using and

, we have

1udec

 

1vv

Wte Wt

(,)Vite

dec



tete

(,Vi









 

(, ) ,

[]

iuk

ii v

t hi

r Wteh

rr Wt

hit hit

rr Wt

hithit







 



























(15)

The last inequality holds because .

Clearly, Cases 1 - 4 are exhaustive and thus Equations

(12) through (15) prove that 1k

, thus

completing our induction argument. Hence V for

all . This completes the proof.

Now we move on to the main theorem of this section

that says that the decision to serve a user in any time

period is monotone in age.

Theorem 3.2 .



,=deci tvdec



,=deci tvProof. Since we have,



() ,

t hit

rWth





 

rW (16)

To prove , we need to prove

dec





iiu v

Wte

hite hi



 







 (17)

which follows from (16) (and tt ), and the

results that







vvv

e WtWt ,







e Wt



hit



[using Equation (4)] and [using

theorem 3.1 and Equation (6)].



ehi t

This theorem implies that if it is optimal to serve a

given user, say , in a given state



,it

of the system,

it is optimal to serve the same user when everything is

identical except the age of the same user () increases by

one. Thus, everything else remaining constant, the opti-

mal policy is monotone in the age of the users, a result

we expect intuitively for any reasonable scheduling pol-

icy, but now proved rigorously for the optimal policy.

Similarly, the rest of the theorems in the paper provide

rigor to intuitively expected monotonocity in different

settings.

3.2. Monotonicity in Rate

The MDP model has been formulated to maximize the

infinite horizon expected total discounted net reward.

The net reward over one time period in a given state





,it

equals the data rate of the user that is chosen to

serve minus the penalty accrued by all other users. We

expect the optimal policy to be monotone in the rate that

can be potentially available to the users. In particular, we

expect that if the optimal policy serves user in state





,it v

, then it will serve in state





,iet

Xn

v as well.

We prove this in theorem 3.3. The proof of theorem 3.3

is similar (but more tedious) to the proof of theorem 3.2.

However, it needs the additional condition of stochastic

monotonicity of DTMCs, see [7].

Theorem 3.3 If the Markov chain



stochastically monotone, then









,= ,=

deci tvdecietv .

(18)





,=decitv we have, Proof. Since







,, .

rWthit

rWthituA





  (19)



To prove



,=dec itv



 



,,0.

ie ie

rr WtWt

hiethi et













, we need to prove







(20)

To establish this, we can first prove









,,,,

vuvu

Vi etVietVitVit  (21)

by considering the set of exhaustive cases similar to the

proof of theorem 3.1. Stochastic monotonicity then im-

plies









,,,,,

vuvu

hie thie thithit  (22)

which yields 20, as required.

3.3. Long-Run Average Reward Criterion

In this section we extend the results of Section 3 to the

long-run average reward criterion. As is well known, the

objective in long-run average reward models is to maxi-

mize the long-run average reward, instead of the ex-

pected total discounted reward as considered in (5). If

N. BOLIA, V. KULKARNI 675



NR nn

π denotes the net reward at time under

policy , then the objective of discounted reward

models is to find a that maximizes







0<1

for a given 







. The objective of long-run average

reward models for the same dynamics, on the other hand,

is to determine the policy that maximizes





,NN

t xZ

limN .

It is well known [8] that a long-run average reward

optimal policy



exists if there

is a constant



,:uit i

(also called the gain) and a bias func-

tion satisfying



,wit



max u

gwit

rWt















pwjt







(23)

The intuitive explanation of

and the bias function

can be found in [3]. Here we end with the result that any

that maximizes ij

u over all

is an optimal action



pwjt



u



rWt



1,, N





,tui in state





,it NN

Define a subset of the state space

 by







=, :=

SitZt



;,,

orexactlyone uandforu

 

v tt



,it



(24)

i.e., a collection of states such that no two users

have the same starvation age and exactly one user has a

starvation age of 0. Consider any stationary policy



,:NN





Z

1n





of the original MDP intro-

duced in the beginning of Section 2. Let

be the DTMC induced by

. Then

we have the following lemma.

Lemma 3.4 is a closed communicating class of S







,,1

XY n



Proof. Let



Y1n

S for some . Since

evolves according to (1) and we serve ex-

actly one user in every time period,



Yn









. It

is straightforward to show that the states in commu-

nicate. Further, since



is a finite and irre-

ducible DTMC, is closed and communicating, as

required.



Yn



n,1Xn

We note that as a result of lemma 3.4 and the evolu-

tion of the age vector , any state







Z 



,:wit









,it S





it is transient. Therefore, we restrict

ourselves to proving monotonicity of the optimal policy

on S. Let be the bias vector satisfy-

ing (23). To prove that the monotonicity in age is valid

(over S) for the long-run average reward criterion, we

need to prove that for



,it S,















iv iji

uij

iv vijv

iu vijv

rWtpwjt r

Wt pwjt

rWte pwjte

rWtepwjte

 

 

 





(25)

To do this, we choose a fixed integer and for each



set





=,>,

Dtt Tu A.

(26)



Now, consider two systems:

• System A: The MDP model described in the beginning

of Section 2 with state space restricted to and with

the extra condition (26).

• System A׳: Identical to System A except that any user

with age T has to be served. Therefore, the state space of

this system is finite and is given by









'=,,:,,

SitStTuA

T SS



(27)

and the transition probabilities, reward structure are the

same as that of System A. Clearly, as ,



Our goal is to prove that for the long-run average re-

ward criterion, the optimal policy is monotone in age in

System A. We will show in theorem 3.5 that the mono-

tonicity in age for the long-run average reward criterion

holds for all fixed T in System

. Further, since

Systems A and



are equivalent in the total optimal

discounted reward sense of (33), we will conclude that

monotonicity in age for the long-run average reward cri-

terion holds for System A. Note that refers to

the decision in state



,deci t





,it

 



. For the long-run average re-

ward criterion, it is obtained using Equation (23) in a

way similar to the discounted reward criterion, i.e., for

the long-run average reward criterion,





,=arg ,

max u

uiu ij

deci trWtp wj t

.

Theorem 3.5 The optimal policy for the long-run

average reward criterion is monotone in age in System







, i.e. for ,it S









,=, =

deci tvdecitev. (28)

Proof. Consider System

S. The state space



finite and using (2), the one step reward is bounded be-

low by





=CrNDT=Cr

1L and above by UN

. Thus

the absolute value of the one step reward is bounded by





=max ,





. Let Vi



be the optimal ex-

pected total discounted reward of System

 starting in

state





,it S





. Then Vi



satisfies the standard

Bellman equation given by (5). Using results in chapter 3

of [6], for a fixed









,km S





N. BOLIA, V. KULKARNI

676









,vit A



 



,,<<Vit VkmC





,=,

for,,it S





(29)

where is a positive constant. Then from Ross [9],

there exists a constant

 and bias function





,wit





 

, .Vkm

















jte









jpVkm





satisfying (23) and given by



 

lim

,= ,

lim

gVkm

witV it







 (30)

Theorem 3.2 implies that



iv ij

uij

iv vij

iu vij

rWtpVjt

WtpV jt

rWte pV

rWtepV





















(31)

Subtracting on both sides



,=Vkm







of both the inequalities in (31) and taking the limit as







jte







, we get



iv ij

uij

iv vij

iu vij

rWtpwjt

Wt pwjt

rWte pw

rWtepw



















(32)

using (??). Equation (32) implies (28), as required.

Thus the optimal policy of System





,t is monotone in

age for every T. Let be the optimal expected

total discounted reward of System





starting in state





,it S. From the definition of Systems

and



, it

is clear [8] that

 





,= ,V itV it



for,Sit





S

. (33)

From Equations (33) and (29) through (32) it is clear

that System A is monotone in age over constructed

using any fixed T. Since S





as , we can

conclude that the optimal policy of the MDP introduced

in the beginning of Section 2 is monotone in age over

for the long-run average reward criterion.

T

Theorem 3.3 can be shown to hold in the long-run av-

erage reward case similarly and we omit the details for

the sake of brevity expected in a correspondence.

3.4. Index Policy and Its Monotonicity

Now we consider the index policy proposed by Bolia and

Kulkarni in [3]. It is described here for completeness.

The decision



in state Yit





 ac-

cording to the index policy is given as follows:

,= 1,

uuu iuu

IitrKtqq













,=arg, .

max uuu

vitIit

 (34)

Here u

and u

are user dependent parameters that

do not change with the state of the system (and as

defined in [3], u, u). We prove the

monotonicity of the index policy in age and rate below.

0K01q

Theorem 3.6 The Index Policy is monotone in age and

rate, i.e.,









,=,=,

vitwvitew (35)









,= ,=,

vit wviet w

(36)

Proof. The left hand side of (35) implies that, for



iww iuu

ww uu

t rKt

qq qq

 



 

 



(37)

Therefore,

iww

iuu

rKt qq

















(38)



yielding



,=viteww, which proves (35). Similarly,

the left hand side of (37) clearly implies,

iww

iuu

rKtqq

rKt qq























vie tw, which proves (36). yielding

4. Conclusion

We considered a cellular data network, i.e. a system with

a fixed number of buffers having time slotted Markov

modulated departures and arrivals. The scheduling prob-

lem was modeled as an MDP and several structural (mo-

notonicity) properties of its optimal policy proven. Al-

though the entire analysis was carried out in the context

of scheduling for wireless cellular data transfer, we em-

phasize that the structural properties hold true for any

system with infinitely backlogged queues.

REFERENCES

[1] P. Bender, P. Black, M. Grob, R. Padovani, N. Sindhus-

N. BOLIA, V. KULKARNI

677

hayana and A. Viterbi, “CDMA/HDR: A Bandwidth Ef-

ficient High Speed Wireless Data Service for Nomadic

Users,” IEEE Communications Magazine, Vol. 38, No.

7, 2000, pp. 70-77. doi:10.1109/35.852034

[2] L. Georgiadis, M. J. Neely and L. Tassiulas, “Resource

Allocation and Cross-Layer Control in Wireless Net-

works,” Foundations and Trends in Networking, Vol. 1,

No. 1, 2006, pp. 1-144. doi:10.1561/1300000001

[3] N. Bolia and V. Kulkarni, “Index Policies for Resource

Allocation in Wireless Networks,” IEEE Transactions on

Vehicular Technology, Vol. 58, No. 4, 2009, pp. 1823-

1835. doi:10.1109/TVT.2008.2005101

[4] D. Tse, “Multiuser Diversity in Wireless Networks,” 2011.

http://www.eecs.berkeley.edu/~dtse/stanford416.ps

[5] M. Andrews, “Instability of the Proportional Fair Sched-

uling Algorithm for HDR,” IEEE Transactions on Wire-

less Communications, Vol. 3, No. 5, 2002, p. 2004.

[6] Q. Hu and W. Yue, “Markov Decision Processes with

Their Applications,” Springer, New York, 2008.

[7] I. Kadi, N. Pekergin and J. M. Vincent, “Analytical and

Stochastic Modeling Techniques and Applications,” Spri-

nger, New York, 2009.

[8] M. Puterman, “Markov Decision Processes: Discrete Sto-

chastic Dynamic Programming,” John Wiley & Sons, Inc,

New York, 1994.

[9] S. M. Ross, “Introduction to Stochastic Dynamic Pro-

gramming,” Academic Press, Inc., New York, 1983.