nips nips2006 nips2006-146 knowledge-graph by maker-knowledge-mining

146 nips-2006-No-regret Algorithms for Online Convex Programs

Source: pdf

Author: Geoffrey J. Gordon

Abstract: Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classiﬁer, dynamically rebalancing a binary search tree, ﬁnding the shortest path in a graph with unknown edge lengths, solving a structured classiﬁcation problem, or ﬁnding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or ﬂexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. [sent-11, score-0.282]

2 In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. [sent-12, score-0.385]

3 After we choose yt , the correct answer is revealed as a convex loss function ℓt (yt ). [sent-17, score-0.326]

4 Our total regret at time t is the difference between these two losses, with positive regret meaning that we would have preferred y to our actual plays: t−1 ρt (y) = Lt − ℓi (y) i=1 ρt = sup ρt (y) y∈Y We assume that Y is a compact convex subset of Rd that has at least two elements. [sent-20, score-0.701]

5 In a more general OCP, Y may have true 1 Many problems use loss functions of the form ℓt (yt ) = ℓ(yt , yt ), where ℓ is a ﬁxed function such as true squared error and yt is a target output. [sent-22, score-0.44]

6 If ℓt is nonlinear but convex, we can substitute the derivative at the current prediction, ∂ℓt (yt ), for ct , and our regret bounds will still hold (see [1, p. [sent-33, score-0.456]

7 We will write C for the set of possible gradient vectors ct . [sent-35, score-0.148]

8 2 Related Work A large number of researchers have studied online prediction in general and OCP in particular. [sent-36, score-0.177]

9 The name “online convex programming” is due to Zinkevich [3], who gave a clever gradient-descent algorithm. [sent-38, score-0.146]

10 A similar algorithm and a weaker bound were presented somewhat earlier in [1]: that paper’s GGD algorithm, using potential function ℓ0 (w) = k w 2 , is equivalent to Zinkevich’s “lazy projection” with a ﬁxed learning rate. [sent-39, score-0.183]

11 Compared to the above papers, the most important contribution of the current paper is its generality: no previous family of OCP algorithms can use as ﬂexible a class of potential functions. [sent-41, score-0.197]

12 Well-known regret bounds for this problem are logarithmic in the number of experts (e. [sent-43, score-0.382]

13 , [5]); no previous bounds for general OCP algorithms are sublinear in the number of experts, but logarithmic bounds follow directly as a special case of our results [6, sec. [sent-45, score-0.266]

14 From the online prediction literature, the closest related work is that of Cesa-Bianchi and Lugosi [7], which follows in the tradition of an algorithm and proof by Blackwell [8]. [sent-51, score-0.174]

15 Given a potential function G, they present algorithms which keep G(Rt ) from growing too quickly. [sent-54, score-0.197]

16 4 can be thought of as the ﬁrst generalization of well-known online learning results such as Cesa-Bianchi and Lugosi’s to online convex programming. [sent-59, score-0.345]

17 The main differences between the Cesa-Bianchi–Lugosi results and ours are the restrictions on their potential functions. [sent-60, score-0.158]

18 They write their potential function as G(u) = f (Φ(u)); they require Φ to be additive (that is, Φ(u) = i φi (ui ) for one-dimensional functions φi ), nonnegative, and twice differentiable, and they require f : R+ → R+ to be increasing, concave, and twice differentiable. [sent-61, score-0.225]

19 These restrictions rule out many of the potential functions used here, and in fact they rule out most online convex programming problems. [sent-62, score-0.456]

20 This work generalizes some of the theorems in [6] and provides a very simple and elegant proof technique for algorithms based on convex potential functions. [sent-69, score-0.333]

21 However, it does not consider the problem of deﬁning appropriate potential functions for the feasible regions of OCPs (as discussed in Sec. [sent-70, score-0.198]

22 yt ← f (st ) ¯ if yt · u > 0 then ¯ yt ← yt /(¯t · u) ¯ y else yt ← arbitrary element of Y ﬁ Observe ct , compute st+1 from (1) end 1 0. [sent-76, score-0.999]

23 Regret Vectors Lagrangian Hedging algorithms maintain their state in a regret vector, st , deﬁned by the recursion st+1 = st + (yt · ct )u − ct (1) with the base case s1 = 0. [sent-94, score-1.036]

24 ) The regret vector contains information about our actual losses and the gradients of our loss functions: from st we can ﬁnd our regret versus any y as follows. [sent-97, score-0.935]

25 At each step it chooses its play based on the current regret vector st (Eq. [sent-106, score-0.623]

26 (1)) and a closed convex potential function F (s) : Rd → R with subgradient f (s) : Rd → Rd . [sent-107, score-0.269]

27 This potential function is what distinguishes one instance of the LH algorithm from another. [sent-108, score-0.158]

28 F (s) should be small when s is in the safe set, and large when s is far from the safe set. [sent-109, score-0.192]

29 (This choice of Y would be appropriate for playing a matrix game or predicting from expert advice. [sent-111, score-0.209]

30 ) For this Y, two possible potential functions are F1 (s) = ln i eηsi − ln d [si ]2 /2 + F2 (s) = i where η > 0 is a learning rate and [s]+ = max(s, 0). [sent-112, score-0.262]

31 The potential F1 leads to the Hedge [5] and weighted majority [13] algorithms, while the potential F2 results in external-regret matching [14, Theorem B]. [sent-113, score-0.401]

32 For more examples of useful potential functions, see [6]. [sent-114, score-0.158]

33 To ensure the LH algorithm chooses legal hypotheses yt ∈ Y, we require the following (note the constant 0 is arbitrary; any other k would work as well) F (s) ≤ 0 ∀s ∈ S (4) Theorem 1 The LH algorithm is well-deﬁned: deﬁne S as in (2) and ﬁx a ﬁnite convex potential function F . [sent-115, score-0.512]

34 If F (s) ≤ 0 for all s ∈ S, then the LH algorithm picks hypotheses yt ∈ Y for all t. [sent-116, score-0.243]

35 ) We can also deﬁne a version of the LH algorithm with an adjustable learning rate: replacing F (s) with F (ηs) is equivalent to updating st with learning rate η. [sent-118, score-0.309]

36 Adjustable learning rates will help us obtain regret bounds for some classes of potentials. [sent-119, score-0.382]

37 5 The Optimization Form Even if we have a convenient representation of our hypothesis space Y, it may not be easy to work directly with the safe set S. [sent-120, score-0.209]

38 In particular, it may be difﬁcult to deﬁne, evaluate, and differentiate a potential function F which has the necessary properties. [sent-121, score-0.158]

39 This form, called the optimization form, deﬁnes F in terms of a simpler function W which we will call the hedging function. [sent-123, score-0.309]

40 So, these hedging functions result in the weighted majority and external-regret matching algorithms. [sent-128, score-0.408]

41 For an example where the hedging function is easy to write analytically but the potential function is much more complicated, see Sec. [sent-129, score-0.468]

42 The optimization form of the LH algorithm using hedging function W is deﬁned to be equivalent to the gradient form using F (s) = sup (s · y − W (¯)) ¯ y (7) y ∈Y ¯ ¯ ¯ Here Y is deﬁned as in (3). [sent-131, score-0.356]

43 Substituting this value for ¯ y back into (7) and using the fact that s · [s]+ = [s]+ · [s]+ , the resulting potential function is ¯ F2 (s) = s · [s]+ − [si ]2 /2 = + i [si ]2 /2 + i as claimed above. [sent-144, score-0.158]

44 This potential function is the standard one for analyzing external-regret matching. [sent-145, score-0.158]

45 And, the optimization form of the LH algorithm using the hedging function W is equivalent to the gradient form of the LH algorithm with potential function F . [sent-150, score-0.514]

46 6 Theoretical Results Our main theoretical results are regret bounds for the LH algorithm. [sent-155, score-0.382]

47 The bounds depend on the curvature of our potential F , the size of the hypothesis set Y, and the possible slopes C of our loss functions. [sent-156, score-0.388]

48 Intuitively, F must be neither too curved nor too ﬂat on the scale of the updates to st from Eq. [sent-157, score-0.309]

49 (1): if F is too curved then ∂F will change too quickly and our hypothesis yt will jump around a lot, while if F is too ﬂat then we will not react quickly enough to changes in regret. [sent-158, score-0.33]

50 For the optimization form, essentially the same results hold, but the constants are deﬁned in terms of the hedging function instead. [sent-160, score-0.309]

51 Therefore, we never need to work with (or even be able to write down) the corresponding potential function. [sent-161, score-0.185]

52 We will assume F (s + ∆) ≤ F (s) + ∆ · f (s) + C ∆ 2 (9) for all regret vectors s and increments ∆, and [F (s) + A]+ ≥ inf B s − s′ ′ p s ∈S (10) for all s. [sent-166, score-0.295]

53 (9), together with the convexity of F , implies that F is differentiable and f is its gradient; the LH algorithm is applicable if F is not differentiable, but its regret bounds are weaker. [sent-169, score-0.41]

54 We have already bounded Y; rather than bounding C and u separately, we will assume that there is a constant D so that E( st+1 − st 2 | st ) ≤ D (12) Here the expectation is taken with respect to our choice of hypothesis, so the inequality must hold for all possible values of ct . [sent-175, score-0.628]

55 (The expectation is only necessary if we randomize our choice of hypothesis, as would happen if Y is the convex hull of some non-convex set. [sent-176, score-0.146]

56 ) Our theorem then bounds our regret in terms of the above constants. [sent-178, score-0.414]

57 Since the bounds are sublinear in t, they show that Lagrangian Hedging is a no-regret algorithm when we choose an appropriate potential F . [sent-179, score-0.298]

58 Theorem 3 Suppose the potential function F is convex and satisﬁes Eqs. [sent-180, score-0.269]

59 2) achieves expected regret E(ρt+1 (y)) ≤ M ((tCD + A)/B)1/p = O(t1/p ) versus any hypothesis y ∈ Y. [sent-184, score-0.408]

60 Deﬁne G(s) = F (ηs), where η= A/(tCD) Then the LH algorithm with potential G achieves regret √ √ E(ρt+1 (y)) ≤ (2M/B) tACD = O( t) for any hypothesis y ∈ Y. [sent-187, score-0.566]

61 4 shows that, if we can guarantee E(st+1 − st ) · ∂F (t) ≤ 0, then F (st ) cannot grow too quickly. [sent-191, score-0.277]

62 This result is analogous to Blackwell’s approachability theorem: since the level sets of F are related to S, we will be able to show st /t → S, implying no regret. [sent-192, score-0.277]

63 Write st = i=0 xi , and let D be a 2 constant so that E( xt | st ) ≤ D. [sent-198, score-0.587]

64 Suppose that, for all t, E(xt · f (st ) | st ) ≤ 0. [sent-199, score-0.277]

65 ) Then: F (st+1 ) = F (st + xt ) ≤ E(F (st+1 ) | st ) ≤ E(F (st+1 ) | s1 ) ≤ E(F (st+1 ) | s1 ) ≤ which is the desired result. [sent-202, score-0.31]

66 7 F (st ) + xt · f (st ) + C xt F (st ) + CD E(F (st ) | s1 ) + CD F (s1 ) + (t − 1)CD + CD 2 2 Examples The classical applications of no-regret algorithms are learning from expert advice and learning to play a repeated matrix game. [sent-203, score-0.268]

67 These two tasks are essentially equivalent, since they both use the probability simplex Y = {y | y ≥ 0, i yi = 1} for their hypothesis set. [sent-204, score-0.215]

68 A large variety of other online prediction problems can also be cast in our framework. [sent-206, score-0.149]

69 These problems include path planning when costs are chosen by an adversary [11], planning in a Markov decision process when costs are chosen by an adversary [15], online pruning of a decision tree [16], and online balancing of a binary search tree [4]. [sent-207, score-0.539]

70 More uses of online convex programming are given in [1, 3, 4]. [sent-208, score-0.258]

71 In each case the bounds for the LH algorithm will be polynomial or better in the dimensionality of the appropriate hypothesis set and sublinear in the number of trials. [sent-209, score-0.253]

72 8 Experiments To demonstrate that our theoretical bounds translate to good practical performance, we implemented the LH algorithm with the potential function W2 from (6) and used it to learn policies for the game of one-card poker. [sent-210, score-0.345]

73 (The hypothesis space for this learning problem is the set of sequence weight vectors, which is convex because one-card poker is an extensive-form game [17]. [sent-211, score-0.445]

74 ) In one-card poker, two players (called the gambler and the dealer) each ante $1 and receive one card from a 13-card deck. [sent-212, score-0.351]

75 The gambler bets ﬁrst, adding either $0 or $1 to the pot. [sent-213, score-0.283]

76 Then the dealer gets a chance to bet, again either $0 or $1. [sent-214, score-0.311]

77 Finally, if the gambler bet $0 and the dealer bet $1, the gambler gets a second chance to bring her bet up to $1. [sent-215, score-1.16]

78 If either player bets $0 when the other has already bet $1, that player folds and loses her ante. [sent-216, score-0.339]

79 If neither player folds, the higher card wins the pot, resulting in a net gain of either $1 or $2 (equal to the other player’s ante plus the bet of $0 or $1). [sent-217, score-0.267]

80 In contrast to the usual practice in poker we assume that the payoff vector ct is observable after each hand; the partially-observable extension is beyond the scope of this paper. [sent-218, score-0.28]

81 One-card poker is a simple game; nonetheless it has many of the elements of more complicated games, including incomplete information, chance events, and multiple stages. [sent-219, score-0.149]

82 3 shows the results of two typical runs: in both panels the dealer is using our no-regret algorithm. [sent-244, score-0.283]

83 In the left panel the gambler is also using our no-regret algorithm, while in the right panel the gambler is playing a ﬁxed policy. [sent-245, score-0.584]

84 The x-axis shows number of hands played; the y-axis shows the average payoff per hand from the dealer to the gambler. [sent-246, score-0.4]

85 The middle solid curve shows the actual performance of the dealer (who is trying to minimize the payoff). [sent-249, score-0.31]

86 The upper curve measures the progress of the dealer’s learning: after every ﬁfth hand we extracted avg a strategy yt by taking the average of our algorithm’s predictions so far. [sent-250, score-0.318]

87 We then plotted the avg avg worst-case value of yt . [sent-251, score-0.397]

88 That is, we plotted the payoff for playing yt against an opponent which avg knows yt and is optimized to maximize the dealer’s losses. [sent-252, score-0.625]

89 In the right panel, the dealer quickly learns to win against the non-adaptive gambler. [sent-254, score-0.283]

90 The dealer never plays a minimax strategy, as shown by the fact that the upper curve does not approach the value of the game. [sent-255, score-0.433]

91 In the left panel, the gambler adapts and forces the dealer to play more conservatively; in this case, the limiting strategies for both players are minimax. [sent-257, score-0.613]

92 3 show an interesting effect: the small, damping oscillations result from the dealer and the gambler “chasing” each other around a minimax strategy. [sent-259, score-0.616]

93 One player will learn to exploit a weakness in the other, but in doing so will open up a weakness in her own play; then the second player will adapt to try to take advantage of the ﬁrst, and the cycle will repeat. [sent-260, score-0.222]

94 Many learning algorithms will cycle so strongly that they fail to achieve the value of the game, but our regret bounds eliminate this possibility. [sent-263, score-0.421]

95 9 Discussion We have presented the Lagrangian Hedging algorithms, a family of no-regret algorithms for OCP based on general potential functions. [sent-264, score-0.197]

96 We have proved regret bounds for LH algorithms and demonstrated experimentally that these bounds lead to good predictive performance in practice. [sent-265, score-0.508]

97 The regret bounds for LH algorithms have low-order dependences on d, the number of dimensions in the hypothesis set Y. [sent-266, score-0.534]

98 This low-order dependence means that the LH algorithms can learn well in prediction problems with complicated hypothesis sets; these problems would otherwise require an impractical amount of training data and computation time. [sent-267, score-0.184]

99 Our work builds on previous work in online learning and online convex programming. [sent-268, score-0.345]

100 Potential-based algorithms in on-line prediction and o a game theory. [sent-306, score-0.171]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lh', 0.423), ('regret', 0.295), ('dealer', 0.283), ('hedging', 0.283), ('st', 0.277), ('gambler', 0.243), ('ocp', 0.223), ('yt', 0.185), ('potential', 0.158), ('bet', 0.121), ('poker', 0.121), ('online', 0.117), ('hypothesis', 0.113), ('convex', 0.111), ('avg', 0.106), ('game', 0.1), ('safe', 0.096), ('minimax', 0.09), ('rd', 0.089), ('bounds', 0.087), ('payoff', 0.085), ('lugosi', 0.075), ('player', 0.074), ('ct', 0.074), ('simplex', 0.067), ('geoffrey', 0.064), ('lagrangian', 0.063), ('advice', 0.061), ('ocps', 0.061), ('tcd', 0.061), ('games', 0.059), ('majority', 0.059), ('hypotheses', 0.058), ('cd', 0.056), ('sublinear', 0.053), ('play', 0.051), ('expert', 0.051), ('gradient', 0.047), ('zinkevich', 0.045), ('pruning', 0.043), ('ante', 0.04), ('bets', 0.04), ('hedge', 0.04), ('helmbold', 0.04), ('orthant', 0.04), ('functions', 0.04), ('algorithms', 0.039), ('decision', 0.039), ('losses', 0.038), ('weakness', 0.037), ('players', 0.036), ('kalai', 0.035), ('randomize', 0.035), ('clever', 0.035), ('manfred', 0.035), ('yi', 0.035), ('tree', 0.035), ('panel', 0.033), ('xt', 0.033), ('plays', 0.033), ('carnegie', 0.032), ('mellon', 0.032), ('lt', 0.032), ('playing', 0.032), ('adjustable', 0.032), ('opponent', 0.032), ('hands', 0.032), ('curved', 0.032), ('card', 0.032), ('blackwell', 0.032), ('twentieth', 0.032), ('ln', 0.032), ('prediction', 0.032), ('theorem', 0.032), ('suppose', 0.031), ('programming', 0.03), ('pseudocode', 0.03), ('yoav', 0.03), ('littlestone', 0.03), ('folds', 0.03), ('loss', 0.03), ('planning', 0.029), ('differentiable', 0.028), ('adversary', 0.028), ('researchers', 0.028), ('chance', 0.028), ('cone', 0.027), ('curve', 0.027), ('write', 0.027), ('robert', 0.026), ('potentials', 0.026), ('colt', 0.026), ('predicting', 0.026), ('optimization', 0.026), ('gordon', 0.026), ('si', 0.026), ('weighted', 0.026), ('bound', 0.025), ('proof', 0.025), ('generality', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

2 0.23666297 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and deﬁnes the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the ﬁltered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both ﬁltered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively ﬁnd the ﬁltered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the ﬁltered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To ﬁnd a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We ﬁnd p(ht+1 |st , it , st+1 , v1:t+1 ) by ﬁrst computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The ﬁrst factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefﬁcients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are deﬁned, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to ﬁnd a suitable way to ‘correct’ the ﬁltered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to ﬁnd a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difﬁculty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we ﬁnd p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the ﬁltered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to ﬁnd p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we ﬁnd that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the ﬁltered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is deﬁned by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difﬁculties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically ﬁnessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difﬁcult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conﬂicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best ﬁltered results are given using ADF, since this is better able to represent the variance in the ﬁltered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the ﬁltered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF ﬁltered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring signiﬁcant beneﬁts. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we brieﬂy present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classiﬁer, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difﬁcult, mainly because the explicit inﬂuence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefﬁcients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we deﬁne ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the ﬁrst row contains the AR coefﬁcients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the ﬁrst component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefﬁcients cr (s) are copied into the ﬁrst row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a ﬁnite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main beneﬁt of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.

3 0.20621702 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artiﬁcial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reﬂects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on ﬁnite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by ﬁnite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to ﬁnd its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reﬂected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the ﬁrst is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of ﬁrst-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would ﬁt into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a ﬁnite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Deﬁnition of strategy costs We ﬁrst deﬁne an N -person normal-form game G = (N, S, p) consisting of ﬁnite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now deﬁne G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at ﬁrst, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a ﬁnite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense deﬁned by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the deﬁnition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the deﬁnition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we brieﬂy give a few deﬁnitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N ﬁxed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The ﬁctitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the ﬁrst t − 1 stages. It has been shown that ﬁctitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, ﬁctitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is deﬁned to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the ﬁrst T stages had it used one ﬁxed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are deﬁned only in terms of best response play. Since G−c and H share the same best responses, ﬁctitious play dynamics are identical for the two games. Since they share the same equilibria and ﬁctitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a ﬁxed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated ﬁctitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selﬁshly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is deﬁned as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is deﬁned to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more ﬂexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs ﬁt easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difﬁcult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difﬁcult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by ﬁnite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an inﬁnitely repeated game with discounting, but restricted to strategies that can be described by ﬁnite automata (the payoffs in every cell of the payoff table are the discounted sums of the inﬁnite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a ﬁnitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an inﬁnite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the ﬁrst and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artiﬁcial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with ﬁnite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justiﬁes Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

4 0.19887732 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

5 0.17519514 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some ﬁnite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper conﬁdence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

6 0.14917222 203 nips-2006-implicit Online Learning with Kernels

7 0.11330687 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

8 0.10983674 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

9 0.099396296 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

10 0.082050554 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

11 0.081076764 79 nips-2006-Fast Iterative Kernel PCA

12 0.079723634 13 nips-2006-A Scalable Machine Learning Approach to Go

13 0.076163411 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

14 0.069757335 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

15 0.068485215 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

16 0.068241477 116 nips-2006-Learning from Multiple Sources

17 0.063117623 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

18 0.060918026 163 nips-2006-Prediction on a Graph with a Perceptron

19 0.06030973 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

20 0.059758514 186 nips-2006-Support Vector Machines on a Budget

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, 0.018), (2, -0.328), (3, -0.078), (4, -0.044), (5, -0.125), (6, -0.094), (7, -0.014), (8, -0.165), (9, 0.006), (10, -0.009), (11, 0.076), (12, -0.084), (13, 0.052), (14, 0.147), (15, 0.005), (16, 0.033), (17, 0.134), (18, -0.115), (19, 0.065), (20, 0.064), (21, -0.032), (22, 0.06), (23, 0.074), (24, -0.028), (25, -0.054), (26, 0.061), (27, 0.026), (28, 0.041), (29, 0.06), (30, 0.092), (31, -0.005), (32, -0.052), (33, -0.085), (34, 0.002), (35, 0.063), (36, 0.0), (37, 0.071), (38, -0.018), (39, -0.019), (40, 0.057), (41, -0.118), (42, -0.075), (43, -0.052), (44, 0.088), (45, -0.072), (46, 0.014), (47, 0.03), (48, -0.005), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95393252 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

2 0.70226711 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

3 0.64581978 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

4 0.53100199 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

5 0.43354496 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are brieﬂy discussed. 1

6 0.43323097 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

7 0.42712295 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

8 0.41626981 203 nips-2006-implicit Online Learning with Kernels

9 0.41479427 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

10 0.40496075 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

11 0.39432508 155 nips-2006-Optimal Single-Class Classification Strategies

12 0.3765119 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

13 0.36037415 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

14 0.34734288 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

15 0.3292239 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

16 0.3256166 196 nips-2006-TrueSkill™: A Bayesian Skill Rating System

17 0.32159215 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

18 0.31245267 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

19 0.30567575 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

20 0.29442537 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.076), (3, 0.02), (7, 0.066), (9, 0.034), (22, 0.073), (44, 0.046), (57, 0.055), (65, 0.48), (69, 0.019), (81, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95092618 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

Author: Marco Cuturi, Kenji Fukumizu

Abstract: We propose a family of kernels for structured objects which is based on the bag-ofcomponents paradigm. However, rather than decomposing each complex object into the single histogram of its components, we use for each object a family of nested histograms, where each histogram in this hierarchy describes the object seen from an increasingly granular perspective. We use this hierarchy of histograms to deﬁne elementary kernels which can detect coarse and ﬁne similarities between the objects. We compute through an efﬁcient averaging trick a mixture of such speciﬁc kernels, to propose a ﬁnal kernel value which weights efﬁciently local and global matches. We propose experimental results on an image retrieval experiment which show that this mixture is an effective template procedure to be used with kernels on histograms.

same-paper 2 0.91917926 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

3 0.9010793 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

Abstract: We present a reduction framework from ordinal regression to binary classiﬁcation based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classiﬁer on the extended examples with any binary classiﬁcation algorithm, and constructing a ranking rule from the binary classiﬁer. A weighted 0/1 loss of the binary classiﬁer would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classiﬁcation approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classiﬁcation. In addition, our framework uniﬁes many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1

4 0.89362699 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus

Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.

5 0.8916533 170 nips-2006-Robotic Grasping of Novel Objects

Author: Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Andrew Y. Ng

Abstract: We consider the problem of grasping novel objects, speciﬁcally ones that are being seen for the ﬁrst time through vision. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Instead it predicts, directly as a function of the images, a point at which to grasp the object. Our algorithm is trained via supervised learning, using synthetic images for the training set. We demonstrate on a robotic manipulation platform that this approach successfully grasps a wide variety of objects, such as wine glasses, duct tape, markers, a translucent box, jugs, knife-cutters, cellphones, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set. 1

6 0.62274063 203 nips-2006-implicit Online Learning with Kernels

7 0.60291392 26 nips-2006-An Approach to Bounded Rationality

8 0.59477991 61 nips-2006-Convex Repeated Games and Fenchel Duality

9 0.57357818 79 nips-2006-Fast Iterative Kernel PCA

10 0.5592801 115 nips-2006-Learning annotated hierarchies from relational data

11 0.54353935 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

12 0.54287481 82 nips-2006-Gaussian and Wishart Hyperkernels

13 0.53128898 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

14 0.52679795 65 nips-2006-Denoising and Dimension Reduction in Feature Space

15 0.5250898 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

16 0.5244469 163 nips-2006-Prediction on a Graph with a Perceptron

17 0.5242244 117 nips-2006-Learning on Graph with Laplacian Regularization

18 0.52307445 150 nips-2006-On Transductive Regression

19 0.51770896 109 nips-2006-Learnability and the doubling dimension

20 0.51676124 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning