Int'l J. of Communications, Network and System Sciences
Vol.07 No.12(2014), Article ID:52032,10 pages
10.4236/ijcns.2014.712050
Playing against Hedge
Miltiades E. Anagnostou1, Maria A. Lambrou2
1School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
2Department of Shipping, Trade and Transport, University of the Aegean, Chios, Greece
Email: miltos@central.ntua.gr, mlambrou@aegean.gr
Copyright © 2014 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/



Received 10 October 2014; revised 20 November 2014; accepted 1 December 2014
ABSTRACT
Hedge has been proposed as an adaptive scheme, which guides the player’s hand in a multi-armed bandit full information game. Applications of this game exist in network path selection, load distribution, and network interdiction. We perform a worst case analysis of the Hedge algorithm by using an adversary, who will consistently select penalties so as to maximize the player’s loss, assuming that the adversary’s penalty budget is limited. We further explore the performance of binary penalties, and we prove that the optimum binary strategy for the adversary is to make greedy decisions.
Keywords:
Hedge algorithm, adversary, online algorithm, greedy algorithm, periodic performance, binary penalties, path selection, network interdiction

1. Introduction
The problems of adaptive network path selection and load distribution have often been considered as games that are played simultaneously and independently by agents controlling flows in a network. A possible abstraction of these and other related problems is the bandit game. In the multi-armed bandit game [1] a player chooses one out of
strategies (or “machines” or “options” or “arms”). A loss or penalty (or a reward, which can be modeled as a negative loss)
is assigned to each strategy
after each round of the game.
An agent facing repeated selections will possibly try to exploit the so far accumulated experience. A popular algorithm that can guide the agent in each selection round is the multiplicative updates algorithm or Hedge. In this paper we calculate the worst possible performance of Hedge by using the adversarial technique, i.e. we investigate the behavior of an intelligent adversary, who tries to maximize the player’s cumulative loss. In Section 1 we describe Hedge; in Section 2 we give a rigorous formulation of the adversary’s problem; in Section 3 we give a recursive solution; and in Section 4 we present sample numerical results. Finally, in Section 5 we explore binary adversarial strategies. Our main result is that the greedy adversarial strategy is optimal among binary strategies.
1.1. The bandit game
In a generalized bandit game the player is allowed to play mixed strategies, i.e. to assign a fraction
(such
that
) of the total bet to option
, thereby getting a loss equal to
. Alternatively, 
can be interpreted as a probability that the player assigns the bet on option
. In the “bandit” version only the total loss
is announced to the player, while in the “full information” version the penalty vector
is announced.
A game consists of
rounds; a superscript
marks the
th
round. Apparently the
player will try to minimize the total cumulative loss

by controlling the bet distribution, i.e. by properly selecting the variables
that the loss budget is limited in each round by setting the constraint
minimize his or her total cumulative loss. An extremely lucky player, or a player with “inside information”, would select the minimum penalty option in each round and would put all his or her bet on this option, thereby
achieving a total loss equal to
1.2. The Hedge algorithm
Quite a few algorithmic solutions, which will guide the player’s hand in the full information game, have appeared in the literature. Freund and Schapire have proposed the Hedge algorithm [2] for the full information game. Auer, Cesa-Bianchi, Freund and Schapire have proposed the Exp3 algorithm in [3] . Allenberg-Neeman and Neeman proposed a Hedge variant, the GL (Gain-Loss) algorithm, for the full information game with gains and losses [4] . Dani, Hayes, and Kakade have proposed the GeometricHedge algorithm in [5] , and a modifi- cation was proposed by Bartlett, Dani et al. in [6] . Recently Cesa-Bianchi and Lugosi have proposed the ComBand algorithm for the bandit version [7] . A comparison can be found in [8] .
Hedge maintains a vector 






determined so as to reflect the loss results, i.e. 


In [9] Auer, Cesa-Bianchi, Freund and Schapire have proved that the expected Hedge performance and the
expected performance of the best arm differ at most by
loss upper bound, which relates the total cumulative loss with the total loss of the best arm.
1.3. Competitive analysis
The competitive analysis of an algorithm

According to S. Irani and A. Karlin (in Section 13.3.1 of [10] ) a technique in finding bounds is to use an “adversary” who plays against 

1.4. Interpretations and applications
In this section we offer some interpretations from the areas of 1) communication networks and 2) transportation. The general setting of course involves a number of options or arms, which must be selected by a player without any knowledge of the future.
Bandit models have been used in quite diverse decision making situations. In [11] He, Chen, Wand and Liu have used a bandit model for the maximization of the revenue of a search engine provider, who charges for advertisements on a per-click basis. They have subsequently defined the “armed bandit problem with shared information”; arms are partitioned in groups and loss information is shared only among players using arms of the same group. In [12] Park and Lee have used a multi-armed bandit model for lane selection in automated highways and autonomous vehicles traffic control.
1.4.1. Traffic load distribution
This first application example can take multiple interpretations, which always involve a selection in a compe- titive environment, in which competition is limited. It can be seen as 1) a path selection problem in networking, 2) a transport means (mode) choice or path selection problem, 3) a computational load distribution problem, which we mention in the end of this section. Firstly, we describe the problem in the context of networking.
Consider 






a population of agents. 

such that


to





path in round



[13] and can be realistic if a network resource still operates in the linear region of the delay vs. load curve, e.g. when delay is calculated in a link, which operates not very close to capacity. Agent 













empty path, thereby achieving a zero delay.
The above problem can also be formulated as a more general problem of distributing workload over a collection of parallel resources (e.g. distributing jobs to parallel processors). A. Blum and C. Burch have used the following motivating scenario in [14] : A process runs on some machine in an environment with 



1.4.2. Interdiction
Although an adversary is usually a “technical” (fictional) concept, which serves the worst case analysis of online algorithms, in some environments a real adversary, who intentionally tries to oppose a player, does exist. An example is the interdiction problem.
We present a version of the interdiction problem in a network security context. An attacker attacks 
streams of harmful packets to resource 



der assigns a defense mechanism of intensity 



Similar interpretations exist in transportation network environments, as in border and custom control, including illegal immigration control. An interdiction problem formulation can be used in a maritime transport security context: pirates attack the vessels traversing a maritime route. In [16] Vanek et al. assign the role of the player to the pirate. The pirate operates in rounds, starting and finishing in his home port. In each round he selects a sea area (arm) to sail to and search for possible victim vessels. A patrol force distributes the available escort resources to sea areas (arms), and pirate gains are inversely proportional to the strength of the defender’s forces on this area. Naval forces reallocate their own resources to sea areas.
2. Problem formulation
In this paper we aim at finding the worst case performance of Hedge. Effectively, we try to solve the following problem:
Problem 1. Given a number of options

Hedge parameter





where 




round penalty weights 

for 

Clearly the objective function (2) is a function of a) the 





dent variables in total. In the following we use 


3. Recursion
Assuming that a given round starts with weights 



Then, the total loss of a 



weights

Note that the term









Assuming that the solution to Problem 1 is 
formula for 

where 
The optimal penalties can be computed also recursively. Let





with weights



In all other rounds 
maximized, i.e. such that 

using penalties



4. Two option games and numerical results
this section we exploit the recursive methodology, which has been presented in the previous section, in order to provide some numerical results for two option games. We compare these results with available bounds in the literature. We consider

extended notation and use the more compact version
single round game is given by

Also, since the initial weights are







Then (6) is simplified to

where 
The iteration starts from
single penalty variable, as the loss is given by (9). Apparently the adversary will choose binary values, i.e.





The graph of 

piecewise linear function of 
points in

Figure 1. Plot of 




order to find analytical expressions for the maximum total loss and the associated penalties, the analysis becomes quite complicated even for small values of the number of rounds 

Instead we have implemented a numerical computation based on (11). 
samples in the interval



functions 




where



use the result as input to (11) and create


to calculate




the initial weight 


that the shape of 

The optimal penalties can be determined by using formulas (7) and (8) for



the greatest weight factor. However, the penalties 

non-binary.
5. Binary and greedy schemes
The penalty values in the first two rounds in the example of Figure 2 prove that the adversary’s optimal penalties are not necessarily binary. However, in this example 


On the other hand the optimal adversarial policy with binary penalties can be found exhaustively as
where 



Apparently, the complexity of this calculation grows with
A “greedy adversary” is eager to punish the maximum weight option as much as possible in each round. Thus
Figure 2. Plot of 




the adversary will assign exactly one unit of penalty to the maximum current weight option, and zero penalties to all other options. Given a sufficient number of rounds (say
option game are “equalized” so that any two weights



equalization is achieved, a periodic phenomenon starts and the greedy penalties form a rotation scheme.
5.1. Greedy behavior
We explore the greedy pattern in a two option game that can easily be generalized to 
initial weights







the weight of the second option becomes for the first time greater than the weight of the first option, and a loss equal to 1 is assigned to the second option. Therefore, in the next step 








The value of 




Therefore, for an even positive integer 

In a game with more than two options it is straightforward to show that in the “steady” (periodic) state weights tend to become equal, i.e. almost equal to

loss is given by 

5.2. Optimality of the greedy behavior
The following proposition provides a simple polynomial solution to the problem of finding the optimal binary adversary.
Proposition 1. The greedy strategy is optimal for the adversary among all strategies with binary penalties. □
Proof: Due to normalization of weights and penalties, in the proof we mention only option 1 weights and penalties. Assuming an initial weight 


emerges before the (n + 1)th round is

available to the adversary in each step, either i) to assign a penalty equal to
mental loss equal to

assign a zero penalty, which will produce a loss equal to 
to

This looks like a new game, in which the adversary is the player. The player’s status is determined by a real
number



a new status









then 



around



convex in
Assume that






bring the player to a point 



player must choose 




The main idea behind this sketch of proof is that a retreat (with consequent low profits 
good investment for the future. Assume 

the game, while



rewards



reached again. In the rest of the game, i.e. until the 
If 


(a), then 


between the cumulative reward on curve (b) and curve (a) is
Figure 3. Sample paths of player behavior, which are used in the proof of Proposition 1.
Effectively we need to show that
















Let 



the worst case, which has just been mentioned,
However, 





This is a consequence of the following lemma:
Lemma 1. For any concave function 

Inequality (15) holds because

which is a consequence of the mean value theorem stating that there is a point 






sing, therefore 

easily generalized to any same length intervals, even overlapping ones, i.e. if

Due to (15) each successive equal length (i.e.

intervals, as long as 
rightmost interval, which does not contain

of all intervals of the same length, which begin to the left of




not true, since at 


the interval 


cave 





However, due to the concavity of



This result states that the interval


interval

Therefore we have seen so far that a sequence of penalties, which begins at some 


respective greedy sequence, which starts at 



Figure 4. Reduction of a sequence of penalties, which contains a fold, to a sequence without folds and with improved total reward.
Suppose that the initial position of the game is at










6. Conclusion
We summarize the main results of this paper: An worst performance (adversarial) analysis of the Hedge algorithm has been presented, under the assumption of limited penalties per round. A recursive expression has been given for the evaluation of the maximum total cumulative loss; this expression can be exploited both numerically and analytically. However, binary penalty schemes provide an excellent approximation to the optimal scheme, and, remarkably, the greedy binary strategy has been proved optimal among binary schemes for the adversary.
References
- Robbins, H. (1952) Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society, 58, 527-535. http://dx.doi.org/10.1090/S0002-9904-1952-09620-8
- Freund, Y. and Schapire, R.E. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55, 119-139. http://dx.doi.org/10.1006/jcss.1997.1504
- Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R.E. (2002) The Non-Stochastic Multi-Armed Bandit Problem. SIAM Journal on Computing, 32, 48-77. http://dx.doi.org/10.1137/S0097539701398375
- Allenberg-Neeman, C. and Neeman, B. (2004) Full Information Game with Gains and Losses. Algorithmic Learning Theory: 15th International Conference, 3244, 264-278.
- Dani, V., Hayes, T.P. and Kakade, S.M. (2008) The Price of Bandit Information for Online Optimization. In: Platt, J.C., Koller, D., Singer, Y. and Roweis, S., Eds., Advances in Neural Information Processing Systems, MIT Press, Cambridge, 345-352.
- Bartlett, P., Dani, V., Hayes, T., Kakade, S., Rakhlin, A. and Tewari, A. (2008) High-Probability Regret Bounds for Bandit Online Linear Optimization. Proceedings of 22nd Annual Conference on Learning Theory (COLT), Helsinki.
- Cesa-Bianchi, N. and Lugosi, G. (2012) Combinatorial Bandits. Journal of Computer and System Sciences, 78, 1404- 1422. http://dx.doi.org/10.1016/j.jcss.2012.01.001
- Uchiya, T., Nakamura, A. and Kudo, M. (2010) Algorithms for Adversarial Bandit Problems with Multiple Plays. In: Hutter, M., Stephan, F., Vovk, V. and Zeugmann, T., Eds., Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence No. 6331, Springer, 375-389.
- Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R.E. (1995) Gambling in a Rigged Casino: The Adversarial Multi-Armed Bandit Problem. Proceedings of 36th Annual Symposium on Foundations of Computer Science, Milwaukee, 322-331.
- Hochbaum, D.S. (1995) Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, Boston.
- He, D., Chen, W., Wang, L. and Liu, T.-Y. (2013) Online Learning for Auction Mechanism in Bandit Setting. Decision Support Systems, 56, 379-386. http://dx.doi.org/10.1016/j.dss.2013.07.004
- Park, C. and Lee, J. (2012) Intelligent Traffic Control Based on Multi-Armed Bandit and Wireless Scheduling Techniques. International Conference on Advances in Vehicular System, Technologies and Applications, Venice, 23-27.
- Bertsekas, D.P. (1998) Network Optimization. Athena Scientific, Belmont.
- Blum, A. and Burch, C. (2000) On-Line Learning and the Metrical Task System Problem. Machine Learning, 39, 35- 88. http://dx.doi.org/10.1023/A:1007621832648
- Cole, S.J. and Lim, C. (2008) Algorithms for Network Interdiction and Fortification Games. Springer Optimization and Its Applications, 17, 609-644. http://dx.doi.org/10.1007/978-0-387-77247-9_24
- Vanĕk, O., Jakob, M. and Pĕchouček, M. (2011) Using Agents to Improve International Maritime Transport Security. IEEE Intelligent Systems, 26, 90-95. http://dx.doi.org/10.1109/MIS.2011.23

















