nips nips2006 nips2006-155 knowledge-graph by maker-knowledge-mining

155 nips-2006-Optimal Single-Class Classification Strategies


Source: pdf

Author: Ran El-Yaniv, Mordechai Nisenson

Abstract: We consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the target distribution is completely known to the learner and the learner’s goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. We identify both “hard” and “soft” optimal classification strategies for different types of games and demonstrate that soft classification can provide a significant advantage. Our optimal strategies and bounds provide worst-case lower bounds for standard, finite-sample SCC and also motivate new approaches to solving SCC.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We consider single-class classification (SCC) as a two-person game between the learner and an adversary. [sent-7, score-0.524]

2 In this game the target distribution is completely known to the learner and the learner’s goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. [sent-8, score-0.671]

3 1 Introduction In Single-Class Classification (SCC) the learner observes a training set of examples sampled from one target class. [sent-11, score-0.389]

4 The goal is to create a classifier that can distinguish the target class from other classes, unknown to the learner during training. [sent-12, score-0.368]

5 The learner observes a training set sampled from P and should then construct a classifier capable of distinguishing the target class. [sent-20, score-0.389]

6 We view the SCC problem as a game between the learner and an adversary. [sent-21, score-0.524]

7 The adversary selects another distribution Q over Ω and then a new element of Ω is drawn from γP + (1 − γ)Q, where γ is a switching parameter (unknown to the learner). [sent-22, score-0.368]

8 The goal of the learner is to minimize the false negative error, while guaranteeing that the false positive error will be at most δ. [sent-23, score-0.478]

9 In this paper we abstract away the statistical estimation component of the problem and model a setting where the learner has a very large sample from the target class. [sent-29, score-0.389]

10 In fact, we assume that the learner knows the target distribution P precisely. [sent-30, score-0.401]

11 In particular, is the popular low-density rejection strategy optimal? [sent-34, score-0.464]

12 The partially good news is that low-density rejection is worst-case optimal, but only if the learner is confined to “hard” decision strategies. [sent-36, score-0.716]

13 In general, the worst-case optimal learner strategy should be “soft”; that is, the learner should play a randomized strategy, which could result in a very significant gain. [sent-37, score-0.811]

14 We first identify a monotonicity property of optimal SCC strategies and use it to establish the optimality of low-density rejection in the “hard” case. [sent-38, score-0.642]

15 We then show an equivalence between low-density rejection and a constrained two-class classification problem where the other class is the uniform distribution over Ω. [sent-39, score-0.408]

16 , [6]), it is necessary to assume that the “attacking distribution” has some worst-case characteristics and it is important to quantify precisely what the adversary knows or can do. [sent-44, score-0.347]

17 The simple observation in this setting is that an omniscient and unlimited adversary, who knows all parameters of the game including the learner’s strategy, would completely demolish the learner who uses hard strategies. [sent-45, score-0.762]

18 By using a soft strategy, however, the learner can achieve on average the biased coin false negative rate of 1 − δ. [sent-46, score-0.559]

19 One of our main contributions is a complete analysis of this game, including identification of the optimal strategy for the learner and the adversary, as well as the best achievable false negative rate. [sent-48, score-0.59]

20 The optimal learner strategy and best achievable rate are obtained via a solution of a linear program specified in terms of the problem parameters. [sent-49, score-0.576]

21 2 Problem Formulation The single-class classification (SCC) problem is defined as a game between the learner and an adversary. [sent-52, score-0.524]

22 The learner receives a training sample of examples from a target distribution P defined over some space Ω. [sent-53, score-0.368]

23 On the basis of this training sample, the learner should select a rejection function r : Ω → [0, 1], where for each ω ∈ Ω, rω = r(ω) is the probability with which the learner will reject ω. [sent-54, score-1.096]

24 On the basis of any knowledge of P and/or r(·), the adversary selects selects an attacking distribution Q, defined over Ω. [sent-55, score-0.405]

25 The rejection rate of the learner, using a rejection △ function r, with respect to any distribution D (over Ω), is ρ(D) = ρ(r, D) = ED {r(ω)}. [sent-57, score-0.785]

26 Before the start of the game, the learner receives a tolerance parameter 0 < δ < 1, giving the maximally allowed false positive rate. [sent-64, score-0.422]

27 A rejection function r(·) is valid if its false positive rate ρ(P ) ≤ δ. [sent-65, score-0.495]

28 A valid rejection function (strategy) is optimal if it guarantees the smallest false negative rate amongst all valid strategies. [sent-66, score-0.591]

29 We consider a model where the learner knows the target distribution P exactly, thus focusing on the decision-theoretic component in SCC. [sent-67, score-0.401]

30 Clearly, our model approximates a setting where the learner has a very large training set, but the results we obtain immediately apply, in any case, as lower bounds to other SCC instances. [sent-68, score-0.374]

31 This SCC game is a two-person zero sum game where the payoff to the learner is ρ(Q). [sent-69, score-0.715]

32 The set △ Rδ (P ) = {r : ρ(P ) ≤ δ} of valid rejection functions is the learner’s strategy space. [sent-70, score-0.496]

33 We are concerned with optimal learner strategies for game variants distinguished by the adversary’s knowledge of the learner’s strategy, P and/or of δ and by other limitations on Q. [sent-72, score-0.671]

34 We distinguish a special type of this game, which we call the hard setting, where the learner must deterministically reject or accept new events; that is, r : Ω → {0, 1}, and such rejection functions are termed “hard. [sent-73, score-0.898]

35 ” The more general game defined above (with “soft” functions) is called the soft setting. [sent-74, score-0.316]

36 In the soft setting, given any rejection function, the learner can reduce the type II error by rejecting more (i. [sent-76, score-0.883]

37 1 3 Characterizing Monotone Rejection Functions In this section we characterize the structure of optimal learner strategies. [sent-94, score-0.397]

38 Intuitively, it seems plausible that the learner should not assign higher rejection values to higher probability events under P . [sent-95, score-0.796]

39 That is, one may expect that a reasonable rejection function r(·) would be monotonically decreasing with probability values (i. [sent-96, score-0.402]

40 Such monotonicity is a key justification for a very large body of SCC work, which is based on low density rejection strategies. [sent-99, score-0.411]

41 The two δ-valid hard rejection functions are r′ = (1, 0, 0) and r′′ = (0, 1, 0). [sent-107, score-0.495]

42 025, and then the optimal rejection function is (0, 0. [sent-135, score-0.447]

43 For any adversary strategy space, Q, let R∗ (P ) be the set of optimal valid rejection δ △ functions, R∗ = {r ∈ Rδ (P ) : minQ∈Q ρ(Q) = maxr′ ∈Rδ (P ) minQ∈Q ρ′ (Q)}. [sent-139, score-0.874]

44 The following property ensures that R∗ will include a monotone (optimal) hard strategy, which δ means that the search space for the learner can be conveniently confined to monotone strategies. [sent-142, score-0.813]

45 While the set of all distributions satisfies this property, later on we will consider limited strategic adversary spaces where this property still holds. [sent-143, score-0.42]

46 , game tree) where in the first move the learner selects a rejection function, followed by a chance move to determine the source (either P or Q) of the test example (with probability γ). [sent-146, score-0.959]

47 In the case where Q is selected, the adversary chooses (randomly using Q) the test example. [sent-147, score-0.314]

48 ′ P if for all j, k and Q ∈ Q such that pj < pk and qj < qk , there exists Q′ ∈ Q such that qk ≤ qj , ′ ′ qj ≥ qk and for all i = j, k, we have qi = qi . [sent-159, score-1.841]

49 3 (Monotone Hard Decisions) When the learner is restricted to hard-decisions and Q satisfies Property A w. [sent-161, score-0.333]

50 P , then ∃r ∈ R∗ such that pj < pk ⇒ r(j) ≥ r(k). [sent-164, score-0.396]

51 4 δ Proof: Let us assume by contradiction that no such rejection function exists in R∗ . [sent-165, score-0.442]

52 Then, there must exist k, such that pj < pk and r(k) = 1 (otherwise r is monotone). [sent-168, score-0.396]

53 We note that ρ∗ (P ) = ρ(P ) + pj − pk < ρ(P ) ≤ ∗ ∗ ∗ ∗ δ. [sent-170, score-0.396]

54 Let Q∗ ∈ Q be such that minQ ρ∗ (Q) = ρ∗ (Q∗ ) = ρ(Q∗ ) + qj − qk . [sent-171, score-0.355]

55 As long as there are more j, k pairs which need to have their rejection levels fixed, we δ label r = r∗ and repeat the above procedure. [sent-177, score-0.383]

56 The final r∗ is in R∗ and satisfies pj < pk ⇒ r(j) ≥ r(k). [sent-179, score-0.396]

57 3 provides a formal justification for the low-density rejection strategy (LDRS), popular in the SCC literature. [sent-182, score-0.464]

58 The corresponding δ-valid low density rejection function places rj = 1 iff j pi ≤ δ. [sent-188, score-0.482]

59 q q P if for all j, k and Q ∈ Q such that 0 < pj ≤ pk and pj < pk , there exists Q′ ∈ Q such that j k ′ qj pj ≥ ′ qk pk ′ and for all i = j, k, qi = qi . [sent-196, score-1.923]

60 P , then ∃r ∈ R∗ such δ that: (i)pi = 0 ⇒ r(i) = 1; (ii) pj < pk ⇒ r(j) ≥ r(k); and (iii) pj = pk ⇒ r(j) = r(k). [sent-202, score-0.792]

61 We show that the low-density rejection strategy (LDRS - defined in Section 3) is optimal. [sent-204, score-0.464]

62 ′ ′ P if for all j, k and Q ∈ Q such that pj = pk there exists Q′ ∈ Q such that qk = qj , qj = qk and ′ for all i = j, k, qi = qi . [sent-212, score-1.486]

63 2 Let r∗ be a δ-valid low-density rejection function (LDRS). [sent-215, score-0.383]

64 Then the two δ-valid LDRS rejection functions are r = (1, 1, 1, 0, 0) and r′ = (1, 1, 0, 1, 0). [sent-228, score-0.383]

65 Thus, any LDRS strategy is indeed worst-case optimal when the learner is willing to be confined to hard rejection functions and when the adversary’s space satisfies Property A and Property C. [sent-238, score-0.973]

66 Any classifier C(·) induces a hard rejection function r(·) by taking r(x) = 1 ⇔ C(x) = c2 . [sent-249, score-0.516]

67 1 Unrestricted Adversary In the first game we analyze an adversary who is completely unrestricted. [sent-264, score-0.505]

68 For △ △ any rejection function r(·), define rmin = mini r(i) and Imin (r) = {i : r(i) = rmin }. [sent-267, score-0.732]

69 For any N N distribution D, ρ(D) = i=1 di r(i) ≥ i=1 di rmin = rmin , in particular, δ = ρ(P ) ≥ rmin and minQ ρ(Q) ≥ rmin . [sent-268, score-0.64]

70 By choosing Q such that qi = 1 for some i ∈ Imin (r), the adversary can achieve ρ(Q) = rmin (the same rejection rate is achieved by taking any Q with qi = 0 for all △ i ∈ Imin (r)). [sent-269, score-1.216]

71 In the soft setting, minQ ρ(Q) is maximized by the rejection function rδ (i) = δ for △ all pi > 0 (rδ (i) = 1 for all pi = 0) This is equivalent to flipping a δ-biased coin for non-null events (under P ). [sent-270, score-0.738]

72 In the hard setting, clearly rmin = 0 (otherwise 1 > δ ≥ 1), and the best achievable Type II Error is precisely 1. [sent-272, score-0.344]

73 This simple analysis shows the futility of the SCC game when the adversary is too powerful. [sent-274, score-0.534]

74 Following similar results in hypothesis testing, we would like to consider games in which the adversary must select Q such that D(P ||Q) ≥ Λ, for some constant Λ > 0, where D(·||·) is the KL-divergence. [sent-282, score-0.34]

75 In this case the adversary can optimally play the same strategy as in the unrestricted game while meeting the KL-divergence constraint. [sent-284, score-0.637]

76 We note, as usual, that the learner can (and should) reject with probability 1 any null events under P . [sent-286, score-0.441]

77 Thus, an adversary would be foolish to choose a distribution Q that has any probability for △ these events. [sent-287, score-0.333]

78 5 there exists a monotone r ∈ R∗ (in both the hard and soft settings) and by Corollary 4. [sent-299, score-0.417]

79 Then, the adversary can play the same strategy as in the unrestricted game, and the learner should select rδ as before. [sent-302, score-0.779]

80 Similarly, if the optimal r is such that there exists j ∈ Imin (r) (that is r(j) = rmin ) and pj ≤ 2−Λ , then a distribution Q that is completely concentrated on j has D(Q||P ) ≥ Λ and achieves ρ(Q) = rmin as in the unrestricted game. [sent-304, score-0.668]

81 We begin our analysis of the game by identifying some useful characteristics of optimal adversary strategies in Lemma 5. [sent-307, score-0.652]

82 3 a linear program that computes the optimal rejection function. [sent-312, score-0.475]

83 1 If Q minimizes ρ(Q) and meets the constraint D(Q||P ) ≥ Λ then: (i) D(Q||P ) = Λ; q q (ii) pj < pk and qk > 0 ⇒ r(j) > r(k); (iii) pj < pk and qj > 0 ⇒ qj log pj + qk log pk > j k (qj + qk ) log qj +qk pk ; (iv) pj < pk and qj > 0 ⇒ qj pj > qk pk ; and (v) qj , qk > 0 ⇒ pj = pk . [sent-315, score-4.676]

84 1, Q is a global minimizer of ρ(Q) subject to the J J q constraints i=1 qi log pi = Λ, qi > 0 (i = 1, . [sent-330, score-0.437]

85 The Lagrangian of this i problem is J L(Q, λ) = J r(i)qi + λ1 i=1 qi log i=1 qi −Λ pi J + λ2 qi − 1 . [sent-334, score-0.607]

86 Therefore, it has an infinite number of roots in any convex 5 For any pair j, k such that pj ≤ pk , D(Q||P ) does not decrease by transferring all the probability from k q q +q q to j in Q: qj log pj + qk log pk ≤ (qj + qk ) log jpj k . [sent-353, score-1.397]

87 As already noted, it is sufficient for the learner to consider only monotone rejection functions. [sent-362, score-0.856]

88 Since for these functions pj = pk ⇒ r(j) = r(k), the learner can partition Ω into K = K(P ) event subsets, which correspond, by probability, to “level sets”, S1 , S2 , . [sent-363, score-0.763]

89 , rK , representing the rejection rate assigned to each of the K level sets (∀ω ∈ Si , r(ω) = ri ). [sent-371, score-0.445]

90 2, the optimal Q which the adversary selects will have an effective support of size 2 at most. [sent-374, score-0.453]

91 Then, there is a single solution to l ql log PqS + (1 − ql ) log 1−ql = Λ, where ql and 1 − ql are the probabilities that Q assigns to the PSh l events from Sl and Sh , respectively. [sent-381, score-0.459]

92 Therefore, the adversary’s choice of an optimal distribution, Q, must have one of |L||H| + |M | ≤ 2 ⌈ K ⌉ (possibly different) rejection rates. [sent-383, score-0.447]

93 We introduce an additional variable, z, to represent the max-min rejection rate. [sent-388, score-0.383]

94 3 An optimal soft rejection function and the lower-bound on the Type II Error, 1 − z, is obtained by solving the following linear program:6 maximizer1 ,r2 ,. [sent-390, score-0.572]

95 1 Numerical Examples We now compare the performance of hard and soft rejection strategies for this constrained game (D(Q||P ) ≥ Λ) for various values of Λ, and two different families of target distributions, P over support N = 50. [sent-399, score-0.998]

96 7 For each such P we solved the optimal hard and soft strategies and computed the corresponding worst-case optimal type II error, 1 − ρ(Q). [sent-402, score-0.49]

97 It is evident that both the soft and hard strategies are ineffective for small Λ. [sent-410, score-0.32]

98 This approach lends itself well to analysis, allowing us to prove under what conditions low-density rejection is hard-optimal and if an optimal monotone rejection function is guaranteed to exist. [sent-446, score-0.97]

99 For example, the equivalence of low-density rejection to the Bayesian binary problem as shown in Section 3. [sent-451, score-0.383]

100 A very interesting setting to consider is one in which the adversary has partial knowledge of the problem parameters and the learner’s strategy. [sent-455, score-0.335]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('scc', 0.407), ('rejection', 0.383), ('learner', 0.333), ('adversary', 0.314), ('pk', 0.203), ('pj', 0.193), ('qj', 0.193), ('game', 0.191), ('qi', 0.17), ('qk', 0.162), ('minq', 0.16), ('rmin', 0.16), ('ldrs', 0.145), ('monotone', 0.14), ('soft', 0.125), ('hard', 0.112), ('ql', 0.088), ('strategies', 0.083), ('strategy', 0.081), ('pi', 0.074), ('imin', 0.073), ('property', 0.065), ('optimal', 0.064), ('events', 0.061), ('false', 0.061), ('lemma', 0.059), ('ii', 0.051), ('achievable', 0.051), ('omniscient', 0.051), ('unrestricted', 0.051), ('novelty', 0.046), ('type', 0.042), ('exists', 0.04), ('technion', 0.038), ('iv', 0.037), ('ps', 0.035), ('theorem', 0.035), ('target', 0.035), ('israel', 0.034), ('classi', 0.034), ('event', 0.034), ('selects', 0.033), ('knows', 0.033), ('intrusion', 0.032), ('meets', 0.032), ('detection', 0.032), ('valid', 0.032), ('violation', 0.03), ('pr', 0.03), ('futility', 0.029), ('markou', 0.029), ('mini', 0.029), ('program', 0.028), ('constraint', 0.028), ('reject', 0.028), ('tolerance', 0.028), ('monotonicity', 0.028), ('satis', 0.027), ('discretized', 0.027), ('games', 0.026), ('attacking', 0.025), ('extremum', 0.025), ('constrained', 0.025), ('rj', 0.025), ('er', 0.025), ('motivates', 0.024), ('log', 0.023), ('justi', 0.023), ('sl', 0.023), ('sm', 0.023), ('guaranteeing', 0.023), ('lambda', 0.023), ('conveniently', 0.023), ('families', 0.023), ('ri', 0.022), ('unlimited', 0.021), ('observes', 0.021), ('coin', 0.021), ('switching', 0.021), ('support', 0.021), ('distributions', 0.021), ('setting', 0.021), ('clearly', 0.021), ('induces', 0.021), ('es', 0.021), ('level', 0.021), ('rk', 0.021), ('corollary', 0.021), ('effective', 0.021), ('literature', 0.021), ('strategic', 0.02), ('sh', 0.02), ('immediately', 0.02), ('optimality', 0.019), ('probability', 0.019), ('cl', 0.019), ('contradiction', 0.019), ('anomaly', 0.019), ('otherwise', 0.019), ('rate', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 155 nips-2006-Optimal Single-Class Classification Strategies

Author: Ran El-Yaniv, Mordechai Nisenson

Abstract: We consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the target distribution is completely known to the learner and the learner’s goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. We identify both “hard” and “soft” optimal classification strategies for different types of games and demonstrate that soft classification can provide a significant advantage. Our optimal strategies and bounds provide worst-case lower bounds for standard, finite-sample SCC and also motivate new approaches to solving SCC.

2 0.14516798 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

3 0.095130175 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

Author: Sridevi Parise, Max Welling

Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1

4 0.093705714 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

5 0.089302361 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

6 0.084708713 13 nips-2006-A Scalable Machine Learning Approach to Go

7 0.05504607 50 nips-2006-Chained Boosting

8 0.054870807 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

9 0.052948248 146 nips-2006-No-regret Algorithms for Online Convex Programs

10 0.05225502 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

11 0.05085193 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

12 0.049626011 193 nips-2006-Tighter PAC-Bayes Bounds

13 0.046458393 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

14 0.044667281 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

15 0.044330057 41 nips-2006-Bayesian Ensemble Learning

16 0.044164997 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

17 0.0439303 186 nips-2006-Support Vector Machines on a Budget

18 0.043006994 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

19 0.042179082 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

20 0.041538343 116 nips-2006-Learning from Multiple Sources


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.132), (1, 0.015), (2, -0.108), (3, -0.086), (4, 0.029), (5, -0.011), (6, -0.096), (7, 0.114), (8, -0.091), (9, 0.057), (10, -0.022), (11, 0.082), (12, 0.049), (13, 0.03), (14, 0.008), (15, -0.013), (16, 0.009), (17, -0.015), (18, 0.029), (19, -0.008), (20, 0.061), (21, 0.052), (22, -0.069), (23, -0.009), (24, 0.005), (25, 0.004), (26, 0.031), (27, 0.108), (28, 0.015), (29, -0.044), (30, -0.064), (31, -0.06), (32, 0.114), (33, 0.031), (34, -0.052), (35, 0.089), (36, 0.052), (37, -0.051), (38, -0.043), (39, -0.003), (40, 0.012), (41, -0.085), (42, -0.154), (43, -0.105), (44, 0.083), (45, 0.068), (46, 0.086), (47, 0.009), (48, -0.059), (49, 0.174)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93581104 155 nips-2006-Optimal Single-Class Classification Strategies

Author: Ran El-Yaniv, Mordechai Nisenson

Abstract: We consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the target distribution is completely known to the learner and the learner’s goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. We identify both “hard” and “soft” optimal classification strategies for different types of games and demonstrate that soft classification can provide a significant advantage. Our optimal strategies and bounds provide worst-case lower bounds for standard, finite-sample SCC and also motivate new approaches to solving SCC.

2 0.54752952 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

3 0.53151745 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

4 0.44523203 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

5 0.43078825 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

6 0.41482937 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

7 0.41452432 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

8 0.38016504 146 nips-2006-No-regret Algorithms for Online Convex Programs

9 0.36826822 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

10 0.36149415 109 nips-2006-Learnability and the doubling dimension

11 0.35114807 193 nips-2006-Tighter PAC-Bayes Bounds

12 0.33067438 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

13 0.33006564 186 nips-2006-Support Vector Machines on a Budget

14 0.32495582 174 nips-2006-Similarity by Composition

15 0.31540653 116 nips-2006-Learning from Multiple Sources

16 0.31502733 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

17 0.30211365 5 nips-2006-A Kernel Method for the Two-Sample-Problem

18 0.30160528 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

19 0.29966989 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

20 0.29521385 96 nips-2006-In-Network PCA and Anomaly Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.114), (3, 0.015), (7, 0.055), (9, 0.063), (20, 0.01), (22, 0.083), (44, 0.078), (57, 0.062), (63, 0.328), (65, 0.036), (69, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77123922 155 nips-2006-Optimal Single-Class Classification Strategies

Author: Ran El-Yaniv, Mordechai Nisenson

Abstract: We consider single-class classification (SCC) as a two-person game between the learner and an adversary. In this game the target distribution is completely known to the learner and the learner’s goal is to construct a classifier capable of guaranteeing a given tolerance for the false-positive error while minimizing the false negative error. We identify both “hard” and “soft” optimal classification strategies for different types of games and demonstrate that soft classification can provide a significant advantage. Our optimal strategies and bounds provide worst-case lower bounds for standard, finite-sample SCC and also motivate new approaches to solving SCC.

2 0.50479752 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

3 0.50320351 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

4 0.49976355 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

5 0.49889752 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.

6 0.49719164 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

7 0.49421021 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

8 0.49340042 194 nips-2006-Towards a general independent subspace analysis

9 0.49267539 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

10 0.49160346 61 nips-2006-Convex Repeated Games and Fenchel Duality

11 0.49082702 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

12 0.49065006 158 nips-2006-PG-means: learning the number of clusters in data

13 0.48818117 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

14 0.48737264 21 nips-2006-AdaBoost is Consistent

15 0.48668963 117 nips-2006-Learning on Graph with Laplacian Regularization

16 0.48662293 167 nips-2006-Recursive ICA

17 0.48662078 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

18 0.48623618 131 nips-2006-Mixture Regression for Covariate Shift

19 0.48621243 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

20 0.48501605 169 nips-2006-Relational Learning with Gaussian Processes