nips nips2006 nips2006-13 knowledge-graph by maker-knowledge-mining

13 nips-2006-A Scalable Machine Learning Approach to Go


Source: pdf

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. [sent-3, score-0.883]

2 Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. [sent-4, score-0.534]

3 Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. [sent-5, score-0.661]

4 The system we propose is capable of automatically learning the propensity of local patterns from a library of games. [sent-6, score-0.313]

5 Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. [sent-7, score-0.194]

6 The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. [sent-8, score-0.866]

7 The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. [sent-9, score-0.227]

8 A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. [sent-11, score-0.972]

9 1 Introduction Go is an ancient board game–over 3,000 years old [6, 5]–that poses unique opportunities and challenges for artificial intelligence and machine learning. [sent-13, score-0.522]

10 The rules of Go are deceptively simple: two opponents alternatively place black and white stones on the empty intersections of an odd-sized square board, traditionally of size 19 × 19. [sent-14, score-0.285]

11 The goal of the game, in simple terms, is for each player to capture as much territory as possible across the board by encircling the opponent’s stones. [sent-15, score-0.697]

12 24 possible board configurations and, on average, on the order of 200-300 possible moves at each step of the game, preventing any form of semi-exhaustive search. [sent-18, score-0.566]

13 For comparison purposes, the game of chess has a much smaller branching factor, on the order of 35-40 [10, 7]. [sent-19, score-0.407]

14 The system ought to be able to transfer information from one board size (e. [sent-34, score-0.479]

15 We take inspiration in three ingredients that seem to be essential to the Go human evaluation process: the understanding of local patterns, the ability to combine patterns, and the ability to relate tactical and strategic goals. [sent-39, score-0.242]

16 The system is capable of automatically learning the propensity of local patterns from a library of games. [sent-41, score-0.313]

17 Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. [sent-42, score-0.194]

18 The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. [sent-43, score-0.866]

19 The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. [sent-44, score-0.227]

20 The main results we present here are derived on a 19 × 19 board using a player trained using only 9 × 9 game data. [sent-46, score-0.963]

21 2 Data Because the approach to be described emphasizes scalability and learning, we are able to train our systems at a given board size and use it to play at different sizes, both larger and smaller. [sent-47, score-0.51]

22 Pure bootstrap approaches to Go where computer players are initialized randomly and play large numbers of games, such as evolutionary approaches or reinforcement learning, have been tried [11]. [sent-48, score-0.199]

23 We have implemented these approaches and used them for small board sizes 5 × 5 and 7 × 7. [sent-49, score-0.492]

24 However, in our experience, these approaches do not scale up well to larger board sizes. [sent-50, score-0.447]

25 For larger board sizes, better results are obtained using training data derived from records of games played by humans. [sent-51, score-0.79]

26 We used available data at board sizes 9 × 9, 13 × 13, and 19 × 19. [sent-52, score-0.492]

27 Most of the games in this data set are played by amateurs. [sent-57, score-0.306]

28 13%) have at least one player with an olf ranking of 29, corresponding to a very good amateur player. [sent-59, score-0.263]

29 Most of the games, however, are played by rather weak players and therefore cannot be used for training. [sent-61, score-0.296]

30 For validation purposes, however, we retained a subset of 91 games where both players have an olf ranking greater or equal to 25–the equivalent of a good amateur player. [sent-62, score-0.559]

31 Data for 19 × 19 Boards: This high-quality data set consists of 1835 games played by professional players (at least 1 dan). [sent-63, score-0.552]

32 6%) are played by 9 dan players (the highest possible ranking). [sent-65, score-0.358]

33 1 System Architecture Evaluation Function, Outputs, and Targets Because Go is a game about territory, it is sensible to have “expected territory” be the evaluation function, and to decompose this expectation as a sum of local probabilities. [sent-68, score-0.451]

34 More specifically, let Aij (t) denote the ownership of intersection ij on the board at time t during the game. [sent-69, score-0.739]

35 At the end of a game, each intersection can be black, white, or both 1 . [sent-70, score-0.204]

36 Let Oij (t) be the output of the learning system at intersection ij at time t in the game. [sent-75, score-0.263]

37 In this case, the output Oij (t) can be interpreted as the probability Pij (t), estimated at time t, of owning the ij intersection at the end of the game. [sent-78, score-0.311]

38 Propagation of information provided by targets/rewards computed at the end of the game only, however, can be problematic. [sent-80, score-0.409]

39 In the simulations presented here, we use a simple scheme Tij (t) = (1 − w)Aij (T ) + wAij (t + k) (1) w ≥ 0 is a parameter that controls the convex combination between the area at the end of the game and the area at some step t + k in the more near future. [sent-82, score-0.517]

40 w = 0 corresponds to the simple case described above where only the area at the end of the game is used in the target function. [sent-83, score-0.463]

41 Other ways of incorporating target information from intermediary game positions are discussed briefly at the end. [sent-84, score-0.397]

42 2 DAG-RNN Architectures The architecture is closely related to an architecture originally proposed for a problem in a completely different area – the prediction of protein contact maps [8, 1]. [sent-87, score-0.178]

43 As a Bayesian network, the architecture can be described in terms of the DAG in Figure 1 where the nodes are arranged in 6 lattice planes reflecting the Go board spatial organization. [sent-88, score-0.697]

44 Each plane contains N × N nodes arranged on the vertices of a square lattice. [sent-89, score-0.17]

45 In addition to the input and output planes, there are four hidden planes for the lateral propagation and integration of information across the Go board. [sent-90, score-0.32]

46 Within each hidden plane, the edges of the quadratic lattice are oriented towards one of the four cardinal directions (NE, NW, SE, and SW). [sent-91, score-0.203]

47 Thus each intersection ij in a N × N board is associated with six units. [sent-93, score-0.668]

48 In addition, since Go is “isotropic” we use a single network shared across the four hidden planes. [sent-97, score-0.188]

49 Go however involves strong boundaries effects and therefore we add one neural network NC for the corners, shared across all four corners, and one neural network NS for each side position, shared across all four sides. [sent-98, score-0.226]

50 Seki is a situation where two live groups share liberties and where neither of them can fill them without dying. [sent-101, score-0.284]

51 The DAG-RNN in the main simulation results uses 16 hidden nodes and 8 output nodes for the lateral propagation networks, and 16 hidden nodes and one output node for the output network. [sent-103, score-0.495]

52 Networks trained at one board size can be reused at any other board size, providing a simple mechanism for reusing and extending acquired knowledge. [sent-107, score-0.941]

53 For a board of size N × N , the training procedure scales like O(W M N 4 ) where W is the number of adjustable weights, and M is the number of training games. [sent-108, score-0.521]

54 There are roughly N 2 board positions in a game and, for each position, N 2 outputs Oij to be trained, hence the O(N 4 ) scaling. [sent-109, score-0.839]

55 Both game records and the positions within each selected game record are randomly selected during training. [sent-110, score-0.722]

56 Weights are updated essentially on line, once every 10 game positions. [sent-111, score-0.361]

57 Figure 1: (a) The nodes of a DAG-RNN are regularly arranged in one input plane, one output plane, and four hidden planes. [sent-117, score-0.227]

58 All the edges of the square lattice in each hidden plane are oriented in the direction of one of the four possible cardinal corners: NE, NW, SW, and SE. [sent-120, score-0.298]

59 Additional directed edges run vertically in column from the input plane to each hidden plane and from each hidden plane to the output plane. [sent-121, score-0.477]

60 The input node is connected to four corresponding hidden nodes, one for each hidden plane. [sent-123, score-0.231]

61 The input node and the hidden nodes are connected to the output node. [sent-124, score-0.202]

62 Connections of each hidden node to its lattice neighbors within the same plane are also shown. [sent-127, score-0.217]

63 3 Inputs At a given board intersection, the input vector Iij has multiple components–listed in Table 1. [sent-129, score-0.447]

64 The group Gij associated with a given intersection is the maximal set of stones of the same color that are connected to it. [sent-133, score-0.326]

65 The idea of using higher order liberties is from Werf [13]. [sent-135, score-0.284]

66 O1st and O2nd provide the number of true eyes and the number of liberties of the weakest and the second weakest neighboring opponent groups. [sent-136, score-0.585]

67 Feature b,w,e influence propensity Neye N1st N2nd N3rd N4th O1st O2nd Description the stone type: black, white or empty the influence from the stones of the same color and the opposing color a local statistics computed from 3 × 3 patterns in the training data (section 3. [sent-141, score-0.625]

68 3) the number of true eyes the number of liberties, which is the number of empty intersections connected to a group of stones. [sent-142, score-0.18]

69 In Chen’s exact method, any opponent stone can block the propagation of influence. [sent-147, score-0.278]

70 With a stringent influence propagation rule, an opponent stone can block the propagation of influence if and only if it is stronger than the stone emitting the influence. [sent-148, score-0.457]

71 Considering rotation and mirror symmetries, there are 10 unique locations for a 3 × 3 window on a 9 × 9 board (see also [9]). [sent-152, score-0.482]

72 Given any 3 × 3 pattern of stones on the board and a set of games, we then compute nine numbers, one for each intersection. [sent-153, score-0.571]

73 w Thus Sij (p) is an empirical normalized estimate of the local differential propensity towards conquering the corresponding intersection in the local context provided by the corresponding pattern and window. [sent-158, score-0.41]

74 In general, a given intersection ij on the board is covered by several 3 × 3 windows. [sent-159, score-0.668]

75 Thus, for a w given intersection ij on a given board, we can compute a value Sij (p) for each different window that contains the intersection. [sent-160, score-0.256]

76 Alternatively, Gibbs sampling can be used to choose the next move among all the legal moves with a probability proportional to eE/T emp , where T emp is a temperature parameter [3, 11, 12]. [sent-166, score-0.307]

77 7 We trained a large number of players using the methods described above. [sent-169, score-0.214]

78 Not surprisingly, when used to play at larger board sizes, the RNNs trained at these small board sizes yield rather weak players. [sent-173, score-1.018]

79 The quality of most 13 × 13 games available to us is too poor for proper training, although a small subset can be used for validation purposes. [sent-174, score-0.237]

80 Thus the most interesting results we report are derived by training the RNNs using the 9×9 game data, and using them to play at 9 × 9 and, more importantly, at larger board sizes. [sent-177, score-0.877]

81 (b) Percentage of moves made by professional human players on boards of size 19 × 19 that are contained in the m top-ranked moves according to the DAG-RNN trained on 9 × 9 amateur data, for various values of m. [sent-194, score-0.807]

82 Note also how the error is smaller towards the end of the game due both to the reduction in the number of possible moves and the strong end-of-game training signal. [sent-200, score-0.565]

83 Thus we can compute the average probability of moves played by good human players according to the DAG-RNN or other probabilistic systems such as [12]. [sent-203, score-0.459]

84 In Table 2, we report such probabilities for several systems and at different board sizes. [sent-204, score-0.487]

85 At this size, the probabilities in Table 2: Probabilities assigned by different systems to moves played by human players in test data. [sent-207, score-0.499]

86 Board Size 9×9 9×9 13 × 13 13 × 13 19 × 19 19 × 19 19 × 19 19 × 19 System Random player RNN(1-ply search) Random player RNN(1-ply search) Random player Boltzmann5 BoltzmannLiberties RNN(1-ply search) Log Probability -4. [sent-208, score-0.324]

87 For boards of size 19 × 19, a random player that selects moves uniformly at random among legal moves assigns a probability of 1/281 to the moves played by professional players in the data set. [sent-217, score-0.993]

88 Our best DAG-RNNs trained using amateur data at 9 × 9 are capable of bringing this probability further down to 1/15 (also a considerable improvement over our previous 1/42 performance presented in April 2006 at the Snowbird Learning Conference). [sent-219, score-0.171]

89 A remarkable example where the top ranked move according to the DAG-RNN coincides with the move actually played in a game between two very highly-ranked players is given in Figure 3, illustrating also the underlying probabilistic territory calculations. [sent-220, score-0.941]

90 For each intersection, the height of the green bar represents the estimated probability that the intersection will be owned by black at the end of the game. [sent-222, score-0.256]

91 Indeed this is the move selected in the game played by Zhou, Heyang (black, 8 dan) and Chang, Hao (white, 9 dan) on 10/22/2000. [sent-226, score-0.561]

92 Figure 2b, provides a kind of ROC curve by displaying the percentage of moves made by professional human player on boards of size 19 × 19 that are contained in the m top-ranked moves according to the DAG-RNN trained on 9 × 9 amateur data, for various values of m across all phases of the game. [sent-227, score-0.748]

93 For instance, when there are 80 stones on the board, and hence on the order of 300 legal moves available, there is a 50% chance that a move selected by a very highly ranked human player (dan 9) is found among the top 30 choices produced by the DAG-RNN. [sent-228, score-0.511]

94 5 Conclusion We have designed a DAG-RNN for the game of Go and demonstrated that it can learn territory predictions fairly well. [sent-229, score-0.503]

95 Systems trained using only a set of 9 × 9 amateur games achieve surprisingly good performance on a 19 × 19 test set that contains 1835 professional played games. [sent-230, score-0.556]

96 Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. [sent-275, score-0.183]

97 SVM and Pattern-Enriched Common Fate Graphs for the game of Go. [sent-278, score-0.361]

98 Temporal difference learning of position evaluation in the game of Go. [sent-290, score-0.409]

99 Learning to score final positions in the game of Go. [sent-304, score-0.361]

100 A new hashing method with application for game playing. [sent-309, score-0.361]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('board', 0.447), ('game', 0.361), ('liberties', 0.284), ('games', 0.177), ('propensity', 0.17), ('players', 0.167), ('intersection', 0.156), ('oij', 0.142), ('territory', 0.142), ('go', 0.137), ('se', 0.131), ('played', 0.129), ('nw', 0.127), ('amateur', 0.124), ('stones', 0.124), ('stone', 0.124), ('sw', 0.119), ('moves', 0.119), ('boards', 0.108), ('player', 0.108), ('opponent', 0.099), ('plane', 0.095), ('sij', 0.087), ('professional', 0.079), ('hidden', 0.075), ('ownership', 0.071), ('tactical', 0.071), ('weakest', 0.071), ('move', 0.071), ('planes', 0.066), ('ij', 0.065), ('uence', 0.065), ('architecture', 0.062), ('dan', 0.062), ('bij', 0.062), ('hij', 0.062), ('rnn', 0.062), ('eyes', 0.06), ('validation', 0.06), ('patterns', 0.058), ('propagation', 0.055), ('area', 0.054), ('boltzmannliberties', 0.053), ('iij', 0.053), ('black', 0.052), ('network', 0.049), ('end', 0.048), ('evaluation', 0.048), ('lattice', 0.047), ('lateral', 0.047), ('trained', 0.047), ('cardinal', 0.046), ('chess', 0.046), ('connected', 0.046), ('legal', 0.045), ('sizes', 0.045), ('human', 0.044), ('library', 0.043), ('symmetries', 0.042), ('local', 0.042), ('output', 0.042), ('search', 0.04), ('probabilities', 0.04), ('intersections', 0.039), ('tij', 0.039), ('challenges', 0.039), ('scalable', 0.039), ('nodes', 0.039), ('aij', 0.038), ('training', 0.037), ('gij', 0.037), ('strategic', 0.037), ('hi', 0.037), ('wij', 0.037), ('arranged', 0.036), ('alphabetical', 0.036), ('ancient', 0.036), ('emp', 0.036), ('intermediary', 0.036), ('rnns', 0.036), ('seki', 0.036), ('werf', 0.036), ('four', 0.035), ('window', 0.035), ('chen', 0.035), ('empty', 0.035), ('white', 0.035), ('targets', 0.033), ('programs', 0.032), ('play', 0.032), ('transfer', 0.032), ('recursive', 0.032), ('outputs', 0.031), ('scalability', 0.031), ('brute', 0.031), ('olf', 0.031), ('stern', 0.031), ('corners', 0.03), ('architectures', 0.03), ('shared', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

2 0.33321494 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

3 0.18262476 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

Author: Chris Murray, Geoffrey J. Gordon

Abstract: In real-world planning problems, we must reason not only about our own goals, but about the goals of other agents with which we may interact. Often these agents’ goals are neither completely aligned with our own nor directly opposed to them. Instead there are opportunities for cooperation: by joining forces, the agents can all achieve higher utility than they could separately. But, in order to cooperate, the agents must negotiate a mutually acceptable plan from among the many possible ones, and each agent must trust that the others will follow their parts of the deal. Research in multi-agent planning has often avoided the problem of making sure that all agents have an incentive to follow a proposed joint plan. On the other hand, while game theoretic algorithms handle incentives correctly, they often don’t scale to large planning problems. In this paper we attempt to bridge the gap between these two lines of research: we present an efficient game-theoretic approximate planning algorithm, along with a negotiation protocol which encourages agents to compute and agree on joint plans that are fair and optimal in a sense defined below. We demonstrate our algorithm and protocol on two simple robotic planning problems.1 1 INTRODUCTION We model the multi-agent planning problem as a general-sum stochastic game with cheap talk: the agents observe the state of the world, discuss their plans with each other, and then simultaneously select their actions. The state and actions determine a one-step reward for each player and a distribution over the world’s next state, and the process repeats. While talking allows the agents to coordinate their actions, it cannot by itself solve the problem of trust: the agents might lie or make false promises. So, we are interested in planning algorithms that find subgame-perfect Nash equilibria. In a subgame-perfect equilibrium, every deviation from the plan is deterred by the threat of a suitable punishment, and every threatened punishment is believable. To find these equilibria, planners must reason about their own and other agents’ incentives to deviate: if other agents have incentives to deviate then I can’t trust them, while if I have an incentive to deviate, they can’t trust me. In a given game there may be many subgame-perfect equilibria with widely differing payoffs: some will be better for some agents, and others will be better for other agents. It is generally not feasible to compute all equilibria [1], and even if it were, there would be no obvious way 1 We gratefully acknowledge help and comments from Ron Parr on this research. This work was supported in part by DARPA contracts HR0011-06-0023 (the CS2P program) and 55-00069 (the RADAR program). All opinions, conclusions, and errors are our own. to select one to implement. It does not make sense for the agents to select an equilibrium without consulting one another: there is no reason that agent A’s part of one joint plan would be compatible with agent B’s part of another joint plan. Instead the agents must negotiate, computing and proposing equilibria until they find one which is acceptable to all parties. This paper describes a planning algorithm and a negotiation protocol which work together to ensure that the agents compute and select a subgame-perfect Nash equilibrium which is both approximately Pareto-optimal (that is, its value to any single agent cannot be improved very much without lowering the value to another another agent) and approximately fair (that is, near the so-called Nash bargaining point). Neither the algorithm nor the protocol is guaranteed to work in all games; however, they are guaranteed correct when they are applicable, and applicability is easy to check. In addition, our experiments show that they work well in some realistic situations. Together, these properties of fairness, enforceability, and Pareto optimality form a strong solution concept for a stochastic game. The use of this definition is one characteristic that distinguishes our work from previous research: ours is the first efficient algorithm that we know of to use such a strong solution concept for stochastic games. Our planning algorithm performs dynamic programming on a set-based value function: for P players, at a state s, V ∈ V(s) ⊂ RP is an estimate of the value the players can achieve. We represent V(s) by sampling points on its convex hull. This representation is conservative, i.e., guarantees that we find a subset of the true V∗ (s). Based on the sampled points we can efficiently compute one-step backups by checking which joint actions are enforceable in an equilibrium. Our negotiation protocol is based on a multi-player version of Rubinstein’s bargaining game. Players together enumerate a set of equilibria, and then take turns proposing an equilibrium from the set. Until the players agree, the protocol ends with a small probability after each step and defaults to a low-payoff equilibrium; the fear of this outcome forces players to make reasonable offers. 2 2.1 BACKGROUND STOCHASTIC GAMES A stochastic game represents a multi-agent planning problem in the same way that a Markov Decision Process [2] represents a single-agent planning problem. As in an MDP, transitions in a stochastic game depend on the current state and action. Unlike MDPs, the current (joint) action is a vector of individual actions, one for each player. More formally, a generalsum stochastic game G is a tuple (S, sstart , P, A, T, R, γ). S is a set of states, and sstart ∈ S is the start state. P is the number of players. A = A1 ×A2 ×. . .×AP is the finite set of joint actions. We deal with fully observable stochastic games with perfect monitoring, where all players can observe previous joint actions. T : S × A → P (S) is the transition function, where P (S) is the set of probability distributions over S. R : S × A → RP is the reward function. We will write Rp (s, a) for the pth component of R(s, a). γ ∈ [0, 1) is the discount factor. Player p wants to maximize her discounted total value for the observed sequence ∞ of states and joint actions s1 , a1 , s2 , a2 , . . ., Vp = t=1 γ t−1 Rp (st , at ). A stationary policy for player p is a function πp : S → P (Ap ). A stationary joint policy is a vector of policies π = (π1 , . . . , πP ), one for each player. A nonstationary policy for player p is a function πp : (∪∞ (S × A)t × S) → P (Ap ) which takes a history of states and joint actions and t=0 produces a distribution over player p’s actions; we can define a nonstationary joint policy analogously. For any nonstationary joint policy, there is a stationary policy that achieves the same value at every state [3]. The value function Vpπ : S → R gives expected values for player p under joint policy π. The value vector at state s, Vπ (s), is the vector with components Vpπ (s). (For a nonstationary policy π we will define Vpπ (s) to be the value if s were the start state, and Vpπ (h) to be the value after observing history h.) A vector V is feasible at state s if there is a π for which Vπ (s) = V, and we will say that π achieves V. We will assume public randomization: the agents can sample from a desired joint action distribution in such a way that everyone can verify the outcome. If public randomization is not directly available, there are cryptographic protocols which can simulate it [4]. This assumption means that the set of feasible value vectors is convex, since we can roll a die at the first time step to choose from a set of feasible policies. 2.2 EQUILIBRIA While optimal policies for MDPs can be determined exactly via various algorithms such as linear programming [2], it isn’t clear what it means to find an optimal policy for a general sum stochastic game. So, rather than trying to determine a unique optimal policy, we will define a set of reasonable policies: the Pareto-dominant subgame-perfect Nash equilibria. A (possibly nonstationary) joint policy π is a Nash equilibrium if, for each individual player, no unilateral deviation from the policy would increase that player’s expected value for playing the game. Nash equilibria can contain incredible threats, that is, threats which the agents have no intention of following through on. To remove this possibility, we can define the subgame-perfect Nash equilibria. A policy π is a subgame-perfect Nash equilibrium if it is a Nash equilibrium in every possible subgame: that is, if there is no incentive for any player to deviate after observing any history of joint actions. Finally, consider two policies π and φ. If Vpπ (sstart ) ≥ Vpφ (sstart ) for all players p, and if Vpπ (sstart ) > Vpφ (sstart ) for at least one p, then we will say that π Pareto dominates φ. A policy which is not Pareto dominated by any other policy is Pareto optimal. 2.3 RELATED WORK Littman and Stone [5] give an algorithm for finding Nash equilibria in two-player repeated games. Hansen et al. [6] show how to eliminate very-weakly-dominated strategies in partially observable stochastic games. Doraszelski and Judd [7] show how to compute Markov perfect equilibria in continuous-time stochastic games. The above papers use solution concepts much weaker than Pareto-dominant subgame-perfect equilibrium, and do not address negotiation and coordination. Perhaps the closest work to the current paper is by Brafman and Tennenholtz [8]: they present learning algorithms which, in repeated self-play, find Pareto-dominant (but not subgame-perfect) Nash equilibria in matrix and stochastic games. By contrast, we consider a single play of our game, but allow “cheap talk” beforehand. And, our protocol encourages arbitrary algorithms to agree on Pareto-dominant equilibria, while their result depends strongly on the self-play assumption. 2.3.1 FOLK THEOREMS In any game, each player can guarantee herself an expected discounted value regardless of what actions the other players takes. We call this value the safety value. Suppose that there is a stationary subgame-perfect equilibrium which achieves the safety value for both players; call this the safety equilibrium policy. Suppose that, in a repeated game, some stationary policy π is better for both players than the safety equilibrium policy. Then we can build a subgame-perfect equilibrium with the same payoff as π: start playing π, and if someone deviates, switch to the safety equilibrium policy. So long as γ is sufficiently large, no rational player will want to deviate. This is the folk theorem for repeated games: any feasible value vector which is strictly better than the safety values corresponds to a subgame-perfect Nash equilibrium [9]. (The proof is slightly more complicated if there is no safety equilibrium policy, but the theorem holds for any repeated game.) There is also a folk theorem for general stochastic games [3]. This theorem, while useful, is not strong enough for our purposes: it only covers discount factors γ which are so close to 1 that the players don’t care which state they wind up in after a possible deviation. In most practical stochastic games, discount factors this high are unreasonably patient. When γ is significantly less than 1, the set of equilibrium vectors can change in strange ways as we change γ [10]. Value to player 2 1 0.5 0 0 0.5 1 1.5 2 Value to player 1 2.5 3 Figure 1: Equilibria of a Rubinstein game with γ = 0.8. Shaded area shows feasible value vectors (U1 (x), U2 (x)) for outcomes x. Right-hand circle corresponds to equilibrium when player 1 moves first, left-hand circle when player 2 moves first. The Nash point is at 3. 2.3.2 RUBINSTEIN’S GAME Rubinstein [11] considered a game where two players divide a slice of pie. The first player offers a division x, 1 − x to the second; the second player either accepts the division, or refuses and offers her own division 1 − y, y. The game repeats until some player accepts an offer or until either player gives up. In the latter case neither player gets any pie. Rubinstein showed that if player p’s utility for receiving a fraction x at time t is Up (x, t) = γ t Up (x) for a discount factor 0 ≤ γ < 1 and an appropriate time-independent utility function Up (x) ≥ 0, then rational players will agree on a division near the so-called Nash bargaining point. This is the point which maximizes the product of the utilities that the players gain by cooperating, U1 (x)U2 (1 − x). As γ ↑ 1, the equilibrium will approach the Nash point. See Fig. 1 for an illustration. For three or more players, a similar result holds where agents take turns proposing multi-way divisions of the pie [12]. See the technical report [13] for more detail on the multi-player version of Rubinstein’s game and the Nash bargaining point. 3 NEGOTIATION PROTOCOL The Rubinstein game implicitly assumes that the result of a failure to cooperate is known to all players: nobody gets any pie. The multi-player version of the game assumes in addition that giving one player a share of the pie doesn’t force us to give a share to any other player. Neither of these properties holds for general stochastic games. They are, however, easy to check, and often hold or can be made to hold for planning domains of interest. So, we will assume that the players have agreed beforehand on a subgame-perfect equilibrium π dis , called the disagreement policy, that they will follow in the event of a negotiation failure. In addition, for games with three or more players, we will assume that each player can unilaterally reduce her own utility by any desired amount without affecting other players’ utilities. Given these assumptions, our protocol proceeds in two phases (pseudocode is given in the technical report [13]. In the first phase agents compute subgame-perfect equilibria and take turns revealing them. On an agent’s turn she either reveals an equilibrium or passes; if all agents pass consecutively, the protocol proceeds to the second phase. When an agent states a policy π, the other agents verify that π is a subgame-perfect equilibrium and calculate its payoff vector Vπ (sstart ); players who state non-equilibrium policies miss their turn. At the end of the first phase, suppose the players have revealed a set Π of policies. Define Xp (π) = Vpπ (sstart ) − Vpdis (sstart ) U = convhull {X(π) | π ∈ Π} U = {u ≥ 0 | (∃v ∈ U | u ≤ v)} where Vdis is the value function of π dis , Xp (π) is the excess of policy π for player p, and U is the set of feasible excess vectors. In the second phase, players take turns proposing points u ∈ U along with policies or mixtures of policies in Π that achieve them. After each proposal, all agents except the pro- poser decide whether to accept or reject. If everyone accepts, the proposal is implemented: everyone starts executing the agreed equilibrium. Otherwise, the players who accepted are removed from future negotiation and have their utilities fixed at the proposed levels. Fixing player p’s utility at up means that all future proposals must give p exactly up . Invalid proposals cause the proposer to lose her turn. To achieve this, the proposal may require p to voluntarily lower her own utility; this requirement is enforced by the threat that all players will revert to π dis if p fails to act as required. If at some point we hit the chance of having the current round of communication end, all remaining players are assigned their disagreement values. The players execute the last proposed policy π (or π dis if there has been no valid proposal), and any player p for whom Vpπ (sstart ) is greater than her assigned utility up voluntarily lowers her utility to the correct level. (Again, failure to do so results in all players reverting to π dis .) Under the above protocol, player’s preferences are the same as in a Rubinstein game with utility set U: because we have assumed that negotiation ends with probability after each message, agreeing on u after t additional steps is exactly as good as agreeing on u(1− )t now. So with sufficiently small, the Rubinstein or Krishna-Serrano results show that rational players will agree on a vector u ∈ U which is close to the Nash point argmaxu∈U Πp up . 4 COMPUTING EQUILIBRIA In order to use the protocol of Sec. 3 for bargaining in a stochastic game, the players must be able to compute some subgame-perfect equilibria. Computing equilibria is a hard problem, so we cannot expect real agents to find the entire set of equilibria. Fortunately, each player will want to find the equilibria which are most advantageous to herself to influence the negotiation process in her favor. But equilibria which offer other players reasonably high reward have a higher chance of being accepted in negotiation. So, self interest will naturally distribute the computational burden among all the players. In this section we describe an efficient dynamic-programming algorithm for computing equilibria. The algorithm takes some low-payoff equilibria as input and (usually) outputs higherpayoff equilibria. It is based on the intuition that we can use low-payoff equilibria as enforcement tools: by threatening to switch to an equilibrium that has low value to player p, we can deter p from deviating from a cooperative policy. pun pun In more detail, we will assume that we are given P different equilibria π1 , . . . , πP ; we pun pun dis will use πp to punish player p if she deviates. We can set πp = π for all p if π dis is the only equilibrium we know; or, we can use any other equilibrium policies that we happen pun to have discovered. The algorithm will be most effective when the value of πp to player p is as low as possible in all states. pun We will then search for cooperative policies that we can enforce with the given threats πp . We will first present an algorithm which pretends that we can efficiently take direct sums and convex hulls of arbitrary sets. This algorithm is impractical, but finds all enforceable value vectors. We will then turn it into an approximate algorithm which uses finite data structures to represent the set-valued variables. As we allow more and more storage for each set, the approximate algorithm will approach the exact one; and in any case the result will be a set of equilibria which the agents can execute. 4.1 THE EXACT ALGORITHM Our algorithm maintains a set of value vectors V(s) for each state s. It initializes V(s) to a set which we know contains the value vectors for all equilibrium policies. It then refines V by dynamic programming: it repeatedly attempts to improve the set of values at each state by backing up all of the joint actions, excluding joint actions from which some agent has an incentive to deviate. In more detail, we will compute Vpdis (s) ≡ Vpπdis (s) for all s and p and use the vector Vdis (s) in our initialization. (Recall that we have defined Vpπ (s) for a nonstationary policy π as the value of π if s were the start state.) We also need the values of the punishment policies for Initialization for s ∈ S V(s) ← {V | Vpdis (s) ≤ Vp ≤ Rmax /(1 − γ)} end Repeat until converged for iteration ← 1, 2, . . . for s ∈ S Compute value vector set for each joint action, then throw away unenforceable vectors for a ∈ A Q(s, a) ← {R(s, a)} + γ s ∈S T (s, a)(s )V(s ) Q(s, a) ← {Q ∈ Q(s, a) | Q ≥ Vdev (s, a)} end We can now randomize among joint actions V(s) ← convhull a Q(s, a) end end Figure 2: Dynamic programming using exact operations on sets of value vectors π pun their corresponding players, Vppun (s) ≡ Vp p (s) for all p and s. Given these values, define Qdev (s, a) = Rp (s, a) + γ p T (s, a)(s )Vppun (s ) (1) s ∈S pun to be the value to player p of playing joint action a from state s and then following πp forever after. From the above Qdev values we can compute player p’s value for deviating from an equilibp rium which recommends action a in state s: it is Qdev (s, a ) for the best possible deviation p a , since p will get the one-step payoff for a but be punished by the rest of the players starting on the following time step. That is, Vpdev (s, a) = max Qdev (s, a1 × . . . × ap × . . . × aP ) p ap ∈Ap (2) Vpdev (s, a) is the value we must achieve for player p in state s if we are planning to recommend pun action a and punish deviations with πp : if we do not achieve this value, player p would rather deviate and be punished. Our algorithm is shown in Fig. 2. After k iterations, each vector in V(s) corresponds to a k-step policy in which no agent ever has an incentive to deviate. In the k + 1st iteration, the first assignment to Q(s, a) computes the value of performing action a followed by any k-step policy. The second assignment throws out the pairs (a, π) for which some agent would want to deviate from a given that the agents plan to follow π in the future. And the convex hull accounts for the fact that, on reaching state s, we can select an action a and future policy π at random from the feasible pairs.2 Proofs of convergence and correctness of the exact algorithm are in the technical report [13]. Of course, we cannot actually implement the algorithm of Fig. 2, since it requires variables whose values are convex sets of vectors. But, we can approximate V(s) by choosing a finite set of witness vectors W ⊂ RP and storing V(s, w) = arg maxv∈V(s) (v·w) for each w ∈ W. V(s) is then approximated by the convex hull of {V(s, w) | w ∈ W}. If W samples the P dimensional unit hypersphere densely enough, the maximum possible approximation error will be small. (In practice, each agent will probably want to pick W differently, to focus her computation on policies in the portion of the Pareto frontier where her own utility is relatively high.) As |W| increases, the error introduced at each step will go to zero. The approximate algorithm is given in more detail in the technical report [13]. 2 It is important for this randomization to occur after reaching state s to avoid introducing incentives to deviate, and it is also important for the randomization to be public. P1 1 P2 1 P1 1 P2 1 P1 1 P2 1 P2 2 P1 2 P2 2 P1 2 P2 2 P1 2 Figure 3: Execution traces for our motion planning example. Left and Center: with 2 witness vectors , the agents randomize between two selfish paths. Right: with 4–32 witnesses, the agents find a cooperative path. Steps where either player gets a goal are marked with ×. 90 shop C D E D Value to Player 2 85 A 80 75 70 ABC 65 E D A B 60 40 50 60 Value to Player 1 70 80 Figure 4: Supply chain management problem. In the left figure, Player 1 is about to deliver part D to the shop, while player 2 is at the warehouse which sells B. The right figure shows the tradeoff between accuracy and computation time. The solid curve is the Pareto frontier for sstart , as computed using 8 witnesses per state. The dashed and dotted lines were computed using 2 and 4 witnesses, respectively. Dots indicate computed value vectors; × marks indicate the Nash points. 5 EXPERIMENTS We tested our value iteration algorithm and negotiation procedure on two robotic planning domains: a joint motion planning problem and a supply-chain management problem. In our motion planning problem (Fig. 3), two players together control a two-wheeled robot, with each player picking the rotational velocity for one wheel. Each player has a list of goal landmarks which she wants to cycle through, but the two players can have different lists of goals. We discretized states based on X, Y, θ and the current goals, and discretized actions into stop, slow (0.45 m ), and fast (0.9 m ), for 9 joint actions and about 25,000 states. s s We discretized time at ∆t = 1s, and set γ = 0.99. For both the disagreement policy and all punishment policies, we used “always stop,” since by keeping her wheel stopped either player can prevent the robot from moving. Planning took a few hours of wall clock time on a desktop workstation for 32 witnesses per state. Based on the planner’s output, we ran our negotiation protocol to select an equilibrium. Fig. 3 shows the results: with limited computation the players pick two selfish paths and randomize equally between them, while with more computation they find the cooperative path. Our experiments also showed that limiting the computation available to one player allows the unrestricted player to reveal only some of the equilibria she knows about, tilting the outcome of the negotiation in her favor (see the technical report [13] for details). For our second experiment we examined a more realistic supply-chain problem. Here each player is a parts supplier competing for the business of an engine manufacturer. The manufacturer doesn’t store items and will only pay for parts which can be used immediately. Each player controls a truck which moves parts from warehouses to the assembly shop; she pays for parts when she picks them up, and receives payment on delivery. Each player gets parts from different locations at different prices and no one player can provide all of the parts the manufacturer needs. Each player’s truck can be at six locations along a line: four warehouse locations (each of which provides a different type of part), one empty location, and the assembly shop. Building an engine requires five parts, delivered in the order A, {B, C}, D, E (parts B and C can arrive in either order). After E, the manufacturer needs A again. Players can move left or right along the line at a small cost, or wait for free. They can also buy parts at a warehouse (dropping any previous cargo), or sell their cargo if they are at the shop and the manufacturer wants it. Each player can only carry one part at a time and only one player can make a delivery at a time. Finally, any player can retire and sell her truck; in this case the game ends and all players get the value of their truck plus any cargo. The disagreement policy is for all players to retire at all states. Fig. 4 shows the computed sets V(sstart ) for various numbers of witnesses. The more witnesses we use, the more accurately we represent the frontier, and the closer our final policy is to the true Nash point. All of the policies computed are “intelligent” and “cooperative”: a human observer would not see obvious ways to improve them, and in fact would say that they look similar despite their differing payoffs. Players coordinate their motions, so that one player will drive out to buy part E while the other delivers part D. They sit idle only in order to delay the purchase of a part which would otherwise be delivered too soon. 6 CONCLUSION Real-world planning problems involve negotiation among multiple agents with varying goals. To take all agents incentives into account, the agents should find and agree on Paretodominant subgame-perfect Nash equilibria. For this purpose, we presented efficient planning and negotiation algorithms for general-sum stochastic games, and tested them on two robotic planning problems. References [1] V. Conitzer and T. Sandholm. Complexity results about Nash equilibria. Technical Report CMU-CS-02-135, School of Computer Science, Carnegie-Mellon University, 2002. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Massachusetts, 1995. [3] Prajit K. Dutta. A folk theorem for stochastic games. Journal of Economic Theory, 66:1–32, 1995. [4] Yevgeniy Dodis, Shai Halevi, and Tal Rabin. A cryptographic solution to a game theoretic problem. In Lecture Notes in Computer Science, volume 1880, page 112. Springer, Berlin, 2000. [5] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, pages 48–54. ACM, 2003. [6] E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, pages 709–715, 2004. [7] Ulrich Doraszelski and Kenneth L. Judd. Avoiding the curse of dimensionality in dynamic stochastic games. NBER Technical Working Paper No. 304, January 2005. [8] R. Brafman and M. Tennenholtz. Efficient learning equilibrium. Artificial Intelligence, 2004. [9] D Fudenberg and E. Maskin. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 1986. [10] David Levine. The castle on the hill. Review of Economic Dynamics, 3(2):330–337, 2000. [11] Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica, 50(1):97–109, 1982. [12] V. Krishna and R. Serrano. Multilateral bargaining. Review of Economic Studies, 1996. [13] Chris Murray and Geoffrey J. Gordon. Multi-robot negotiation: approximating the set of subgame perfect equilibria in general-sum stochastic games. Technical Report CMU-ML-06114, Carnegie Mellon University, 2006.

4 0.1184791 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

5 0.094837584 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

6 0.084708713 155 nips-2006-Optimal Single-Class Classification Strategies

7 0.079723634 146 nips-2006-No-regret Algorithms for Online Convex Programs

8 0.074949041 14 nips-2006-A Small World Threshold for Economic Network Formation

9 0.0647365 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

10 0.062524691 196 nips-2006-TrueSkill™: A Bayesian Skill Rating System

11 0.05720716 35 nips-2006-Approximate inference using planar graph decomposition

12 0.051324341 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

13 0.04941671 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

14 0.046453986 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

15 0.044767883 59 nips-2006-Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons

16 0.044526301 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

17 0.044019569 57 nips-2006-Conditional mean field

18 0.043438002 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

19 0.042124167 18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons

20 0.038126305 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.151), (1, -0.023), (2, -0.113), (3, -0.138), (4, 0.138), (5, -0.194), (6, -0.142), (7, 0.244), (8, -0.218), (9, 0.075), (10, -0.117), (11, 0.124), (12, -0.02), (13, -0.008), (14, -0.026), (15, 0.047), (16, -0.011), (17, -0.05), (18, 0.03), (19, 0.081), (20, -0.009), (21, -0.076), (22, -0.045), (23, 0.054), (24, -0.01), (25, 0.027), (26, 0.051), (27, 0.024), (28, -0.056), (29, -0.1), (30, 0.003), (31, 0.007), (32, 0.019), (33, -0.013), (34, -0.024), (35, 0.023), (36, 0.004), (37, -0.049), (38, -0.064), (39, 0.015), (40, 0.038), (41, -0.06), (42, -0.075), (43, 0.011), (44, -0.005), (45, 0.046), (46, 0.067), (47, 0.013), (48, 0.033), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9406653 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

2 0.91027462 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

3 0.76281548 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

Author: Chris Murray, Geoffrey J. Gordon

Abstract: In real-world planning problems, we must reason not only about our own goals, but about the goals of other agents with which we may interact. Often these agents’ goals are neither completely aligned with our own nor directly opposed to them. Instead there are opportunities for cooperation: by joining forces, the agents can all achieve higher utility than they could separately. But, in order to cooperate, the agents must negotiate a mutually acceptable plan from among the many possible ones, and each agent must trust that the others will follow their parts of the deal. Research in multi-agent planning has often avoided the problem of making sure that all agents have an incentive to follow a proposed joint plan. On the other hand, while game theoretic algorithms handle incentives correctly, they often don’t scale to large planning problems. In this paper we attempt to bridge the gap between these two lines of research: we present an efficient game-theoretic approximate planning algorithm, along with a negotiation protocol which encourages agents to compute and agree on joint plans that are fair and optimal in a sense defined below. We demonstrate our algorithm and protocol on two simple robotic planning problems.1 1 INTRODUCTION We model the multi-agent planning problem as a general-sum stochastic game with cheap talk: the agents observe the state of the world, discuss their plans with each other, and then simultaneously select their actions. The state and actions determine a one-step reward for each player and a distribution over the world’s next state, and the process repeats. While talking allows the agents to coordinate their actions, it cannot by itself solve the problem of trust: the agents might lie or make false promises. So, we are interested in planning algorithms that find subgame-perfect Nash equilibria. In a subgame-perfect equilibrium, every deviation from the plan is deterred by the threat of a suitable punishment, and every threatened punishment is believable. To find these equilibria, planners must reason about their own and other agents’ incentives to deviate: if other agents have incentives to deviate then I can’t trust them, while if I have an incentive to deviate, they can’t trust me. In a given game there may be many subgame-perfect equilibria with widely differing payoffs: some will be better for some agents, and others will be better for other agents. It is generally not feasible to compute all equilibria [1], and even if it were, there would be no obvious way 1 We gratefully acknowledge help and comments from Ron Parr on this research. This work was supported in part by DARPA contracts HR0011-06-0023 (the CS2P program) and 55-00069 (the RADAR program). All opinions, conclusions, and errors are our own. to select one to implement. It does not make sense for the agents to select an equilibrium without consulting one another: there is no reason that agent A’s part of one joint plan would be compatible with agent B’s part of another joint plan. Instead the agents must negotiate, computing and proposing equilibria until they find one which is acceptable to all parties. This paper describes a planning algorithm and a negotiation protocol which work together to ensure that the agents compute and select a subgame-perfect Nash equilibrium which is both approximately Pareto-optimal (that is, its value to any single agent cannot be improved very much without lowering the value to another another agent) and approximately fair (that is, near the so-called Nash bargaining point). Neither the algorithm nor the protocol is guaranteed to work in all games; however, they are guaranteed correct when they are applicable, and applicability is easy to check. In addition, our experiments show that they work well in some realistic situations. Together, these properties of fairness, enforceability, and Pareto optimality form a strong solution concept for a stochastic game. The use of this definition is one characteristic that distinguishes our work from previous research: ours is the first efficient algorithm that we know of to use such a strong solution concept for stochastic games. Our planning algorithm performs dynamic programming on a set-based value function: for P players, at a state s, V ∈ V(s) ⊂ RP is an estimate of the value the players can achieve. We represent V(s) by sampling points on its convex hull. This representation is conservative, i.e., guarantees that we find a subset of the true V∗ (s). Based on the sampled points we can efficiently compute one-step backups by checking which joint actions are enforceable in an equilibrium. Our negotiation protocol is based on a multi-player version of Rubinstein’s bargaining game. Players together enumerate a set of equilibria, and then take turns proposing an equilibrium from the set. Until the players agree, the protocol ends with a small probability after each step and defaults to a low-payoff equilibrium; the fear of this outcome forces players to make reasonable offers. 2 2.1 BACKGROUND STOCHASTIC GAMES A stochastic game represents a multi-agent planning problem in the same way that a Markov Decision Process [2] represents a single-agent planning problem. As in an MDP, transitions in a stochastic game depend on the current state and action. Unlike MDPs, the current (joint) action is a vector of individual actions, one for each player. More formally, a generalsum stochastic game G is a tuple (S, sstart , P, A, T, R, γ). S is a set of states, and sstart ∈ S is the start state. P is the number of players. A = A1 ×A2 ×. . .×AP is the finite set of joint actions. We deal with fully observable stochastic games with perfect monitoring, where all players can observe previous joint actions. T : S × A → P (S) is the transition function, where P (S) is the set of probability distributions over S. R : S × A → RP is the reward function. We will write Rp (s, a) for the pth component of R(s, a). γ ∈ [0, 1) is the discount factor. Player p wants to maximize her discounted total value for the observed sequence ∞ of states and joint actions s1 , a1 , s2 , a2 , . . ., Vp = t=1 γ t−1 Rp (st , at ). A stationary policy for player p is a function πp : S → P (Ap ). A stationary joint policy is a vector of policies π = (π1 , . . . , πP ), one for each player. A nonstationary policy for player p is a function πp : (∪∞ (S × A)t × S) → P (Ap ) which takes a history of states and joint actions and t=0 produces a distribution over player p’s actions; we can define a nonstationary joint policy analogously. For any nonstationary joint policy, there is a stationary policy that achieves the same value at every state [3]. The value function Vpπ : S → R gives expected values for player p under joint policy π. The value vector at state s, Vπ (s), is the vector with components Vpπ (s). (For a nonstationary policy π we will define Vpπ (s) to be the value if s were the start state, and Vpπ (h) to be the value after observing history h.) A vector V is feasible at state s if there is a π for which Vπ (s) = V, and we will say that π achieves V. We will assume public randomization: the agents can sample from a desired joint action distribution in such a way that everyone can verify the outcome. If public randomization is not directly available, there are cryptographic protocols which can simulate it [4]. This assumption means that the set of feasible value vectors is convex, since we can roll a die at the first time step to choose from a set of feasible policies. 2.2 EQUILIBRIA While optimal policies for MDPs can be determined exactly via various algorithms such as linear programming [2], it isn’t clear what it means to find an optimal policy for a general sum stochastic game. So, rather than trying to determine a unique optimal policy, we will define a set of reasonable policies: the Pareto-dominant subgame-perfect Nash equilibria. A (possibly nonstationary) joint policy π is a Nash equilibrium if, for each individual player, no unilateral deviation from the policy would increase that player’s expected value for playing the game. Nash equilibria can contain incredible threats, that is, threats which the agents have no intention of following through on. To remove this possibility, we can define the subgame-perfect Nash equilibria. A policy π is a subgame-perfect Nash equilibrium if it is a Nash equilibrium in every possible subgame: that is, if there is no incentive for any player to deviate after observing any history of joint actions. Finally, consider two policies π and φ. If Vpπ (sstart ) ≥ Vpφ (sstart ) for all players p, and if Vpπ (sstart ) > Vpφ (sstart ) for at least one p, then we will say that π Pareto dominates φ. A policy which is not Pareto dominated by any other policy is Pareto optimal. 2.3 RELATED WORK Littman and Stone [5] give an algorithm for finding Nash equilibria in two-player repeated games. Hansen et al. [6] show how to eliminate very-weakly-dominated strategies in partially observable stochastic games. Doraszelski and Judd [7] show how to compute Markov perfect equilibria in continuous-time stochastic games. The above papers use solution concepts much weaker than Pareto-dominant subgame-perfect equilibrium, and do not address negotiation and coordination. Perhaps the closest work to the current paper is by Brafman and Tennenholtz [8]: they present learning algorithms which, in repeated self-play, find Pareto-dominant (but not subgame-perfect) Nash equilibria in matrix and stochastic games. By contrast, we consider a single play of our game, but allow “cheap talk” beforehand. And, our protocol encourages arbitrary algorithms to agree on Pareto-dominant equilibria, while their result depends strongly on the self-play assumption. 2.3.1 FOLK THEOREMS In any game, each player can guarantee herself an expected discounted value regardless of what actions the other players takes. We call this value the safety value. Suppose that there is a stationary subgame-perfect equilibrium which achieves the safety value for both players; call this the safety equilibrium policy. Suppose that, in a repeated game, some stationary policy π is better for both players than the safety equilibrium policy. Then we can build a subgame-perfect equilibrium with the same payoff as π: start playing π, and if someone deviates, switch to the safety equilibrium policy. So long as γ is sufficiently large, no rational player will want to deviate. This is the folk theorem for repeated games: any feasible value vector which is strictly better than the safety values corresponds to a subgame-perfect Nash equilibrium [9]. (The proof is slightly more complicated if there is no safety equilibrium policy, but the theorem holds for any repeated game.) There is also a folk theorem for general stochastic games [3]. This theorem, while useful, is not strong enough for our purposes: it only covers discount factors γ which are so close to 1 that the players don’t care which state they wind up in after a possible deviation. In most practical stochastic games, discount factors this high are unreasonably patient. When γ is significantly less than 1, the set of equilibrium vectors can change in strange ways as we change γ [10]. Value to player 2 1 0.5 0 0 0.5 1 1.5 2 Value to player 1 2.5 3 Figure 1: Equilibria of a Rubinstein game with γ = 0.8. Shaded area shows feasible value vectors (U1 (x), U2 (x)) for outcomes x. Right-hand circle corresponds to equilibrium when player 1 moves first, left-hand circle when player 2 moves first. The Nash point is at 3. 2.3.2 RUBINSTEIN’S GAME Rubinstein [11] considered a game where two players divide a slice of pie. The first player offers a division x, 1 − x to the second; the second player either accepts the division, or refuses and offers her own division 1 − y, y. The game repeats until some player accepts an offer or until either player gives up. In the latter case neither player gets any pie. Rubinstein showed that if player p’s utility for receiving a fraction x at time t is Up (x, t) = γ t Up (x) for a discount factor 0 ≤ γ < 1 and an appropriate time-independent utility function Up (x) ≥ 0, then rational players will agree on a division near the so-called Nash bargaining point. This is the point which maximizes the product of the utilities that the players gain by cooperating, U1 (x)U2 (1 − x). As γ ↑ 1, the equilibrium will approach the Nash point. See Fig. 1 for an illustration. For three or more players, a similar result holds where agents take turns proposing multi-way divisions of the pie [12]. See the technical report [13] for more detail on the multi-player version of Rubinstein’s game and the Nash bargaining point. 3 NEGOTIATION PROTOCOL The Rubinstein game implicitly assumes that the result of a failure to cooperate is known to all players: nobody gets any pie. The multi-player version of the game assumes in addition that giving one player a share of the pie doesn’t force us to give a share to any other player. Neither of these properties holds for general stochastic games. They are, however, easy to check, and often hold or can be made to hold for planning domains of interest. So, we will assume that the players have agreed beforehand on a subgame-perfect equilibrium π dis , called the disagreement policy, that they will follow in the event of a negotiation failure. In addition, for games with three or more players, we will assume that each player can unilaterally reduce her own utility by any desired amount without affecting other players’ utilities. Given these assumptions, our protocol proceeds in two phases (pseudocode is given in the technical report [13]. In the first phase agents compute subgame-perfect equilibria and take turns revealing them. On an agent’s turn she either reveals an equilibrium or passes; if all agents pass consecutively, the protocol proceeds to the second phase. When an agent states a policy π, the other agents verify that π is a subgame-perfect equilibrium and calculate its payoff vector Vπ (sstart ); players who state non-equilibrium policies miss their turn. At the end of the first phase, suppose the players have revealed a set Π of policies. Define Xp (π) = Vpπ (sstart ) − Vpdis (sstart ) U = convhull {X(π) | π ∈ Π} U = {u ≥ 0 | (∃v ∈ U | u ≤ v)} where Vdis is the value function of π dis , Xp (π) is the excess of policy π for player p, and U is the set of feasible excess vectors. In the second phase, players take turns proposing points u ∈ U along with policies or mixtures of policies in Π that achieve them. After each proposal, all agents except the pro- poser decide whether to accept or reject. If everyone accepts, the proposal is implemented: everyone starts executing the agreed equilibrium. Otherwise, the players who accepted are removed from future negotiation and have their utilities fixed at the proposed levels. Fixing player p’s utility at up means that all future proposals must give p exactly up . Invalid proposals cause the proposer to lose her turn. To achieve this, the proposal may require p to voluntarily lower her own utility; this requirement is enforced by the threat that all players will revert to π dis if p fails to act as required. If at some point we hit the chance of having the current round of communication end, all remaining players are assigned their disagreement values. The players execute the last proposed policy π (or π dis if there has been no valid proposal), and any player p for whom Vpπ (sstart ) is greater than her assigned utility up voluntarily lowers her utility to the correct level. (Again, failure to do so results in all players reverting to π dis .) Under the above protocol, player’s preferences are the same as in a Rubinstein game with utility set U: because we have assumed that negotiation ends with probability after each message, agreeing on u after t additional steps is exactly as good as agreeing on u(1− )t now. So with sufficiently small, the Rubinstein or Krishna-Serrano results show that rational players will agree on a vector u ∈ U which is close to the Nash point argmaxu∈U Πp up . 4 COMPUTING EQUILIBRIA In order to use the protocol of Sec. 3 for bargaining in a stochastic game, the players must be able to compute some subgame-perfect equilibria. Computing equilibria is a hard problem, so we cannot expect real agents to find the entire set of equilibria. Fortunately, each player will want to find the equilibria which are most advantageous to herself to influence the negotiation process in her favor. But equilibria which offer other players reasonably high reward have a higher chance of being accepted in negotiation. So, self interest will naturally distribute the computational burden among all the players. In this section we describe an efficient dynamic-programming algorithm for computing equilibria. The algorithm takes some low-payoff equilibria as input and (usually) outputs higherpayoff equilibria. It is based on the intuition that we can use low-payoff equilibria as enforcement tools: by threatening to switch to an equilibrium that has low value to player p, we can deter p from deviating from a cooperative policy. pun pun In more detail, we will assume that we are given P different equilibria π1 , . . . , πP ; we pun pun dis will use πp to punish player p if she deviates. We can set πp = π for all p if π dis is the only equilibrium we know; or, we can use any other equilibrium policies that we happen pun to have discovered. The algorithm will be most effective when the value of πp to player p is as low as possible in all states. pun We will then search for cooperative policies that we can enforce with the given threats πp . We will first present an algorithm which pretends that we can efficiently take direct sums and convex hulls of arbitrary sets. This algorithm is impractical, but finds all enforceable value vectors. We will then turn it into an approximate algorithm which uses finite data structures to represent the set-valued variables. As we allow more and more storage for each set, the approximate algorithm will approach the exact one; and in any case the result will be a set of equilibria which the agents can execute. 4.1 THE EXACT ALGORITHM Our algorithm maintains a set of value vectors V(s) for each state s. It initializes V(s) to a set which we know contains the value vectors for all equilibrium policies. It then refines V by dynamic programming: it repeatedly attempts to improve the set of values at each state by backing up all of the joint actions, excluding joint actions from which some agent has an incentive to deviate. In more detail, we will compute Vpdis (s) ≡ Vpπdis (s) for all s and p and use the vector Vdis (s) in our initialization. (Recall that we have defined Vpπ (s) for a nonstationary policy π as the value of π if s were the start state.) We also need the values of the punishment policies for Initialization for s ∈ S V(s) ← {V | Vpdis (s) ≤ Vp ≤ Rmax /(1 − γ)} end Repeat until converged for iteration ← 1, 2, . . . for s ∈ S Compute value vector set for each joint action, then throw away unenforceable vectors for a ∈ A Q(s, a) ← {R(s, a)} + γ s ∈S T (s, a)(s )V(s ) Q(s, a) ← {Q ∈ Q(s, a) | Q ≥ Vdev (s, a)} end We can now randomize among joint actions V(s) ← convhull a Q(s, a) end end Figure 2: Dynamic programming using exact operations on sets of value vectors π pun their corresponding players, Vppun (s) ≡ Vp p (s) for all p and s. Given these values, define Qdev (s, a) = Rp (s, a) + γ p T (s, a)(s )Vppun (s ) (1) s ∈S pun to be the value to player p of playing joint action a from state s and then following πp forever after. From the above Qdev values we can compute player p’s value for deviating from an equilibp rium which recommends action a in state s: it is Qdev (s, a ) for the best possible deviation p a , since p will get the one-step payoff for a but be punished by the rest of the players starting on the following time step. That is, Vpdev (s, a) = max Qdev (s, a1 × . . . × ap × . . . × aP ) p ap ∈Ap (2) Vpdev (s, a) is the value we must achieve for player p in state s if we are planning to recommend pun action a and punish deviations with πp : if we do not achieve this value, player p would rather deviate and be punished. Our algorithm is shown in Fig. 2. After k iterations, each vector in V(s) corresponds to a k-step policy in which no agent ever has an incentive to deviate. In the k + 1st iteration, the first assignment to Q(s, a) computes the value of performing action a followed by any k-step policy. The second assignment throws out the pairs (a, π) for which some agent would want to deviate from a given that the agents plan to follow π in the future. And the convex hull accounts for the fact that, on reaching state s, we can select an action a and future policy π at random from the feasible pairs.2 Proofs of convergence and correctness of the exact algorithm are in the technical report [13]. Of course, we cannot actually implement the algorithm of Fig. 2, since it requires variables whose values are convex sets of vectors. But, we can approximate V(s) by choosing a finite set of witness vectors W ⊂ RP and storing V(s, w) = arg maxv∈V(s) (v·w) for each w ∈ W. V(s) is then approximated by the convex hull of {V(s, w) | w ∈ W}. If W samples the P dimensional unit hypersphere densely enough, the maximum possible approximation error will be small. (In practice, each agent will probably want to pick W differently, to focus her computation on policies in the portion of the Pareto frontier where her own utility is relatively high.) As |W| increases, the error introduced at each step will go to zero. The approximate algorithm is given in more detail in the technical report [13]. 2 It is important for this randomization to occur after reaching state s to avoid introducing incentives to deviate, and it is also important for the randomization to be public. P1 1 P2 1 P1 1 P2 1 P1 1 P2 1 P2 2 P1 2 P2 2 P1 2 P2 2 P1 2 Figure 3: Execution traces for our motion planning example. Left and Center: with 2 witness vectors , the agents randomize between two selfish paths. Right: with 4–32 witnesses, the agents find a cooperative path. Steps where either player gets a goal are marked with ×. 90 shop C D E D Value to Player 2 85 A 80 75 70 ABC 65 E D A B 60 40 50 60 Value to Player 1 70 80 Figure 4: Supply chain management problem. In the left figure, Player 1 is about to deliver part D to the shop, while player 2 is at the warehouse which sells B. The right figure shows the tradeoff between accuracy and computation time. The solid curve is the Pareto frontier for sstart , as computed using 8 witnesses per state. The dashed and dotted lines were computed using 2 and 4 witnesses, respectively. Dots indicate computed value vectors; × marks indicate the Nash points. 5 EXPERIMENTS We tested our value iteration algorithm and negotiation procedure on two robotic planning domains: a joint motion planning problem and a supply-chain management problem. In our motion planning problem (Fig. 3), two players together control a two-wheeled robot, with each player picking the rotational velocity for one wheel. Each player has a list of goal landmarks which she wants to cycle through, but the two players can have different lists of goals. We discretized states based on X, Y, θ and the current goals, and discretized actions into stop, slow (0.45 m ), and fast (0.9 m ), for 9 joint actions and about 25,000 states. s s We discretized time at ∆t = 1s, and set γ = 0.99. For both the disagreement policy and all punishment policies, we used “always stop,” since by keeping her wheel stopped either player can prevent the robot from moving. Planning took a few hours of wall clock time on a desktop workstation for 32 witnesses per state. Based on the planner’s output, we ran our negotiation protocol to select an equilibrium. Fig. 3 shows the results: with limited computation the players pick two selfish paths and randomize equally between them, while with more computation they find the cooperative path. Our experiments also showed that limiting the computation available to one player allows the unrestricted player to reveal only some of the equilibria she knows about, tilting the outcome of the negotiation in her favor (see the technical report [13] for details). For our second experiment we examined a more realistic supply-chain problem. Here each player is a parts supplier competing for the business of an engine manufacturer. The manufacturer doesn’t store items and will only pay for parts which can be used immediately. Each player controls a truck which moves parts from warehouses to the assembly shop; she pays for parts when she picks them up, and receives payment on delivery. Each player gets parts from different locations at different prices and no one player can provide all of the parts the manufacturer needs. Each player’s truck can be at six locations along a line: four warehouse locations (each of which provides a different type of part), one empty location, and the assembly shop. Building an engine requires five parts, delivered in the order A, {B, C}, D, E (parts B and C can arrive in either order). After E, the manufacturer needs A again. Players can move left or right along the line at a small cost, or wait for free. They can also buy parts at a warehouse (dropping any previous cargo), or sell their cargo if they are at the shop and the manufacturer wants it. Each player can only carry one part at a time and only one player can make a delivery at a time. Finally, any player can retire and sell her truck; in this case the game ends and all players get the value of their truck plus any cargo. The disagreement policy is for all players to retire at all states. Fig. 4 shows the computed sets V(sstart ) for various numbers of witnesses. The more witnesses we use, the more accurately we represent the frontier, and the closer our final policy is to the true Nash point. All of the policies computed are “intelligent” and “cooperative”: a human observer would not see obvious ways to improve them, and in fact would say that they look similar despite their differing payoffs. Players coordinate their motions, so that one player will drive out to buy part E while the other delivers part D. They sit idle only in order to delay the purchase of a part which would otherwise be delivered too soon. 6 CONCLUSION Real-world planning problems involve negotiation among multiple agents with varying goals. To take all agents incentives into account, the agents should find and agree on Paretodominant subgame-perfect Nash equilibria. For this purpose, we presented efficient planning and negotiation algorithms for general-sum stochastic games, and tested them on two robotic planning problems. References [1] V. Conitzer and T. Sandholm. Complexity results about Nash equilibria. Technical Report CMU-CS-02-135, School of Computer Science, Carnegie-Mellon University, 2002. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Massachusetts, 1995. [3] Prajit K. Dutta. A folk theorem for stochastic games. Journal of Economic Theory, 66:1–32, 1995. [4] Yevgeniy Dodis, Shai Halevi, and Tal Rabin. A cryptographic solution to a game theoretic problem. In Lecture Notes in Computer Science, volume 1880, page 112. Springer, Berlin, 2000. [5] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, pages 48–54. ACM, 2003. [6] E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, pages 709–715, 2004. [7] Ulrich Doraszelski and Kenneth L. Judd. Avoiding the curse of dimensionality in dynamic stochastic games. NBER Technical Working Paper No. 304, January 2005. [8] R. Brafman and M. Tennenholtz. Efficient learning equilibrium. Artificial Intelligence, 2004. [9] D Fudenberg and E. Maskin. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 1986. [10] David Levine. The castle on the hill. Review of Economic Dynamics, 3(2):330–337, 2000. [11] Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica, 50(1):97–109, 1982. [12] V. Krishna and R. Serrano. Multilateral bargaining. Review of Economic Studies, 1996. [13] Chris Murray and Geoffrey J. Gordon. Multi-robot negotiation: approximating the set of subgame perfect equilibria in general-sum stochastic games. Technical Report CMU-ML-06114, Carnegie Mellon University, 2006.

4 0.76055545 196 nips-2006-TrueSkill™: A Bayesian Skill Rating System

Author: Ralf Herbrich, Tom Minka, Thore Graepel

Abstract: unkown-abstract

5 0.59786057 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

6 0.4967519 155 nips-2006-Optimal Single-Class Classification Strategies

7 0.38378981 146 nips-2006-No-regret Algorithms for Online Convex Programs

8 0.35358003 61 nips-2006-Convex Repeated Games and Fenchel Duality

9 0.32876411 14 nips-2006-A Small World Threshold for Economic Network Formation

10 0.22771437 41 nips-2006-Bayesian Ensemble Learning

11 0.20900331 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

12 0.20866957 113 nips-2006-Learning Structural Equation Models for fMRI

13 0.20568867 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

14 0.20305221 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

15 0.20269422 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

16 0.20168836 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

17 0.19874886 18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons

18 0.19656695 129 nips-2006-Map-Reduce for Machine Learning on Multicore

19 0.19083807 35 nips-2006-Approximate inference using planar graph decomposition

20 0.1896493 43 nips-2006-Bayesian Model Scoring in Markov Random Fields


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.061), (3, 0.03), (7, 0.087), (9, 0.053), (22, 0.044), (44, 0.055), (52, 0.397), (57, 0.06), (65, 0.038), (69, 0.038), (81, 0.032), (90, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77241802 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

2 0.64180887 105 nips-2006-Large Margin Component Analysis

Author: Lorenzo Torresani, Kuang-chih Lee

Abstract: Metric learning has been shown to significantly improve the accuracy of k-nearest neighbor (kNN) classification. In problems involving thousands of features, distance learning algorithms cannot be used due to overfitting and high computational complexity. In such cases, previous work has relied on a two-step solution: first apply dimensionality reduction methods to the data, and then learn a metric in the resulting low-dimensional subspace. In this paper we show that better classification performance can be achieved by unifying the objectives of dimensionality reduction and metric learning. We propose a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin. This projection is defined by a significantly smaller number of parameters than metrics learned in input space, and thus our optimization reduces the risks of overfitting. Theory and results are presented for both a linear as well as a kernelized version of the algorithm. Overall, we achieve classification rates similar, and in several cases superior, to those of support vector machines. 1

3 0.63009244 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

4 0.36563608 80 nips-2006-Fundamental Limitations of Spectral Clustering

Author: Boaz Nadler, Meirav Galun

Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1

5 0.35635161 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

Author: Gloria Haro, Gregory Randall, Guillermo Sapiro

Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1

6 0.35517374 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

7 0.35435706 158 nips-2006-PG-means: learning the number of clusters in data

8 0.35310745 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

9 0.35127911 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

10 0.35127497 167 nips-2006-Recursive ICA

11 0.35040769 117 nips-2006-Learning on Graph with Laplacian Regularization

12 0.34876433 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

13 0.34867343 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

14 0.34805816 34 nips-2006-Approximate Correspondences in High Dimensions

15 0.34762207 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

16 0.34692991 65 nips-2006-Denoising and Dimension Reduction in Feature Space

17 0.3468293 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

18 0.34679514 128 nips-2006-Manifold Denoising

19 0.34667218 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

20 0.34591714 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation