nips nips2006 nips2006-81 knowledge-graph by maker-knowledge-mining

81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding


Source: pdf

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Game theoretic algorithms for Protein-DNA binding Luis E. [sent-1, score-0.498]

2 edu Abstract We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. [sent-10, score-0.862]

3 The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. [sent-11, score-1.578]

4 1 Introduction Transcriptional control relies in part on coordinate operation of DNA binding regulators and their interactions with various co-factors. [sent-14, score-0.472]

5 We believe game theory and economic models provide an appropriate modeling framework for understanding interacting regulatory processes. [sent-15, score-0.304]

6 In particular, the problem of understanding coordinate binding of regulatory proteins has many game theoretic properties. [sent-16, score-1.028]

7 At low nuclear concentrations, regulatory proteins may occupy only high affinity sites, while filling weaker sites with increasing concentration. [sent-18, score-0.674]

8 Overlapping or close binding sites create explicit competition for the sites, the resolution of which is guided by the available concentrations around the binding sites. [sent-19, score-1.29]

9 Similarly, explicit coordination such as formation of larger protein complexes may be required for binding or, alternatively, binding may be facilitated by the presence of another protein. [sent-20, score-1.153]

10 The key advantage of games as models of binding is that they can provide causally meaningful predictions (binding arrangements) in response to various experimental perturbations or disruptions. [sent-21, score-0.451]

11 2 Protein-DNA binding We decompose the binding problem into transport and local binding. [sent-27, score-1.011]

12 By transport, we refer to the mechanism that transports proteins to the neighborhood of sites to which they have affinity. [sent-28, score-0.548]

13 We abstract the process initially by assuming separate affinities for proteins to explore neighborhoods of specific sites, modulated by whether the sites are available. [sent-30, score-0.562]

14 We aim to capture the differentiated manner in which proteins may accumulate in the neighborhoods of sites depending on the overall nuclear concentrations and regardless of the time involved. [sent-32, score-0.778]

15 Local binding, on the other hand, captures which proteins bind to each site as a consequence of local accumulations or concentrations around the site or a larger region. [sent-33, score-0.912]

16 In a steady state, the local environment of the site is assumed to be closed and well-mixed. [sent-34, score-0.286]

17 Broadly speaking, the combination of transport and local binding results in an arrangement of proteins along the possible DNA binding sites. [sent-37, score-1.237]

18 The predictions should be viewed as functions of the overall (nuclear) concentrations of proteins, the affinities of proteins to explore neighborhoods of individual sites, as well as the equilibrium constants characterizing the ability of proteins to bind to specific sites when in close proximity. [sent-39, score-1.265]

19 3 Game Theoretic formulation There are two types of players in our game, proteins and sites. [sent-41, score-0.286]

20 A protein-player refers to a type of protein, not an individual protein, and decides how its nuclear concentration is allocated to the proximity of sites (transport process). [sent-42, score-0.535]

21 In other words, their allocations are based on the transport affinities and the availability of sites rather than through some negotiation process involving multiple proteins. [sent-44, score-0.488]

22 The non-coopeative nature of the protein allocations does not, however, preclude the formation of protein complexes or binding facilitated by other proteins. [sent-45, score-1.033]

23 Each possible binding site is associated with a site-player. [sent-47, score-0.697]

24 Site-players choose the fraction of time (or fraction of cells in a population) a specific type of protein is bound to the site. [sent-48, score-0.357]

25 The strategies of the site-players are guided by local chemical equilibria. [sent-50, score-0.284]

26 Indeed, the site-players are introduced merely to reproduce this physical understanding of the binding process in a game theoretic context. [sent-51, score-0.779]

27 The binding game has no global objective function that serves to guide how the players choose their strategies. [sent-53, score-0.738]

28 For example, the protein-player allocates its nuclear concentration to the proximity of the sites based on how occupied the sites are, i. [sent-55, score-0.776]

29 Similarly, the site-players reproduce the chemical equilibrium at the sites on the basis of the available local protein concentrations, i. [sent-58, score-0.982]

30 The predictions we can make based on the game theoretic formulation are equilibria of the game (not to be confused with the local chemical equilibria at the sites). [sent-61, score-0.968]

31 At an equilibrium, no reallocation of proteins to sites is required and, conversely, the sites have reproduced the local chemical equilibria based on the current allocations of proteins. [sent-62, score-1.15]

32 While games need not have equilibria in pure strategies (actions available to the players), our game will always have one. [sent-63, score-0.457]

33 4 The binding game To specify the game more formally we proceed to define players’ strategies, their utilities, and the notion of an equilibrium of the game. [sent-64, score-1.178]

34 To this end, let f i represent the (nuclear) concentration of protein i. [sent-65, score-0.284]

35 This is the amount of protein available to be allocated to the neighborhoods of sites. [sent-66, score-0.378]

36 The fraction of protein i allocated to site j is specified by pi , where j pi = 1. [sent-67, score-1.018]

37 The numerical values j j of pi , where j ranges over the possible sites, define a possible strategy for the ith protein player. [sent-68, score-0.497]

38 The choices of which strategies to play are guided by parameters Eij , the affinity of protein i to explore the neighborhood of site j (we will generally index proteins with i and sites with j). [sent-70, score-1.192]

39 The utility for protein i, defined below, provides a numerical ranking of possible strategy choices and is parameterized by Eij . [sent-71, score-0.356]

40 The strategy for site-player j specifies the fraction of time that each type of protein is actually bound to the site. [sent-73, score-0.352]

41 The strategy is denoted by sj , where i ranges over proteins with affinity to the site. [sent-74, score-0.506]

42 Note i that the values of sj are in principle observable from binding assays (cf. [sent-75, score-0.648]

43 i sj ≤ 1 since there i i is only one site and it may remain empty part of the time. [sent-77, score-0.483]

44 We will also use αj = i sj to denote how occupied i the site is. [sent-81, score-0.519]

45 The utilities of the site players will depend on Kij , the chemical equilibrium constants characterizing the local binding reaction between protein i and site j. [sent-82, score-1.802]

46 Utilities The utility function for protein-player i is formally defined as ui (pi , s) ≡ sj ) + βH(pi ) i pi Eij (1 − j j (1) i where H(pi ) = − j pi log pi is the Shannon entropy of the strategy pi and j ranges over possible j j j sites. [sent-83, score-1.124]

47 The utility of the protein-player essentially states that protein i “prefers” to be around sites that are unbound and for which it has high affinity. [sent-84, score-0.615]

48 The parameter β ≥ 0 balances how much protein allocations are guided by the differentiated process, characterized by the exploration affinities Eij , as opposed to allocated uniformly (maximizing the entropy function). [sent-85, score-0.461]

49 Note that since the utility depends on the strategies of site-players through (1 − i sj ), one cannot find i the equilibrium strategy for proteins by considering sj to be fixed; the sites will respond to any pi j i chosen by the protein-player. [sent-87, score-1.597]

50 As discussed earlier, the site-players always reproduce the chemical equilibrium between the site and the protein species allocated to the neighborhood of the site. [sent-88, score-1.088]

51 The equilibrium equation holds for all protein species around the site and for the same strategy {sj } of the site-player. [sent-90, score-0.84]

52 The utility function that reproduces this chemical equilibrium when maximized over possible strategies is given by sj − Kij (pi f i − sj ) 1 − j i i vj (sj , p) ≡ i sj i (3) i subject to sj ≤ Kij (pi f i − sj )(1 − i sj ), sj ≤ pi f i , and i sj ≤ 1. [sent-94, score-2.502]

53 These constraints j j i i i i i guarantee that the utility is always non-positive and zero exactly when the chemical equilibrium holds. [sent-95, score-0.459]

54 sj ≤ pi f i ensures that we cannot have more protein bound than is allocated to the proximity j i of the site. [sent-96, score-0.805]

55 1 The game and equilibria The protein-DNA binding game is now fully specified by the set of parameters {Eij /β}, {Kij } and {f i }, along with the utility functions {ui } and {vj } and the allocation constraints {P i } and {S j }. [sent-101, score-1.15]

56 In terms of our game theoretic model, this corresponds to what we call an equilibrium of the game. [sent-103, score-0.567]

57 Informally, an equilibrium of a game is a strategy for each player such that no individual has any incentive to unilaterally deviate from their strategy. [sent-104, score-0.536]

58 Formally, if the allocations (¯, s) are such that for each protein i and each site j, p ¯ pi ∈ arg maxi ui (pi , s), and sj ∈ arg ¯ ¯ ¯ i p ∈P max sj ∈S j (pj ) ¯ vj (sj , pj ), ¯ (4) then we call (¯, s) an equilibrium of the protein-DNA binding game. [sent-105, score-1.953]

59 Put another way, at an equilibp ¯ rium, the current strategies of the players must be among the strategies that maximize their utilities assuming the strategies of other players are held fixed. [sent-106, score-0.477]

60 Does the protein-DNA binding game always have an equilibrium? [sent-107, score-0.678]

61 While we have already stated this in the affirmative, we emphasize that there is no reason a priori to believe that there exists an equilibrium in the pure strategies, especially since the sets of possible strategies for the site-players are non-convex (cf. [sent-108, score-0.346]

62 At any such equilibrium of the game, all the protein species around each site are at a chemical equilibrium; that is, if (¯, s) is an equilibrium p ¯ of the game, then for all sites j and proteins i, sj and pi satisfy (2). [sent-114, score-2.115]

63 Consequently, the site utilities ¯ ¯j vj (¯j , pj ) are all zero for the equilibrium strategies. [sent-115, score-0.645]

64 2 Computing equilibria The equilibria of the binding game represent predicted binding arrangements. [sent-117, score-1.343]

65 Our game has special structure and properties that permit us to find an equilibrium efficiently through a simple iterative algorithm. [sent-118, score-0.5]

66 The algorithm monotonically fills the sites up to the equilibrium levels, starting with all sites empty. [sent-119, score-0.863]

67 We begin by first expressing any joint equilibrium strategy of the game as a function of how filled the sites are, and reduce the problem of finding equilibria to finding fixed points of a monotone function. [sent-120, score-0.986]

68 To this end, let αj = i sj denote site j occupancy, the fraction of time it is bound by any protein. [sent-121, score-0.552]

69 , the occupancies for all the m sites, then we can readily obtain the maximizing strategies for proteins expressed as a function of site occupancies: pi (α) ∝ exp(Eij (1 − αj )/β), where the maximizing strategies are functions j of α. [sent-128, score-0.942]

70 Similarly, at the equilibrium, each site-player achieves a local chemical equilibrium specified in (2). [sent-129, score-0.406]

71 Note that sj (α) depends not only on how filled site j is but also on how occupied the other sites are i through pi (α). [sent-131, score-1.001]

72 The algorithm Let α(t) denote the site occupancies at the tth iteration of the algorithm. [sent-140, score-0.343]

73 3 The λ-phage binding game We use the well-known λ-phage viral infection [11, 1] to illustrate the game theoretic approach. [sent-178, score-1.04]

74 The components of the λ−switch are 1) two adjacent genes cI and Cro that encode cI2 and Cro proteins, respectively; 2) the promoter regions PRM and PR of these genes, and 3) an operator (OR ) with three binding sites OR 1, OR 2, and OR 3. [sent-180, score-0.893]

75 Since the presence of cI2 in either OR 1 or OR 3 blocks the access of RNA-polymerase to the promoter region PR , or PRM respectively, we can safely restrict ourselves to operator sites as the site-players. [sent-183, score-0.437]

76 Slightly higher concentrations of cI2 lead to binding at OR 2 which in turn facilitates RNApolymerase to initiate transcription at PRM 3. [sent-186, score-0.569]

77 25 −1 10 0 1 10 10 f cI /fRNA− p 2 10 3 probability of binding 0. [sent-190, score-0.431]

78 75 0 Binding in OR1 1 1 probability of binding probability of binding 1 0. [sent-191, score-0.862]

79 25 0 −1 10 0 10 (a) OR 3 1 10 f cI /fRNA− p 2 10 2 2 (b) OR 2 (c) OR 1 Figure 1: Predicted protein binding to sites OR 3, OR 2, and OR 1 for increasing amounts of cI2 . [sent-199, score-0.973]

80 Game parameters The game requires three sets of parameters: chemical equilibrium constants, affinities, and protein concentrations. [sent-203, score-0.88]

81 For a typical Escherichia coli ( 2µm length) at room temperature, the Gibbs’ Free energies ∆G tabulated by [11] yield the equilibrium constants shown below; in addition, we set transport affinities in accordance with the qualitative description in [7, 8], Kij cI2 RNA-p OR 3 . [sent-208, score-0.45]

82 Simulation Results The predictions from the game theoretic model exactly mirror the known behavior. [sent-225, score-0.334]

83 Figure 1 illustrates how the binding at different sites changes as a function of increasing fcI2 . [sent-227, score-0.726]

84 Our model predicts that cI2 begins binding OR 1 at the same level as [1] predicts a decline in the transcription of Cro. [sent-236, score-0.536]

85 While consistent, we emphasize that the methods differ in their goals; stochastic simulation focuses on the dynamics of transcription while we study the strategic allocation of proteins as a function of their concentration. [sent-237, score-0.331]

86 4 A structured extension The game theoretic formulation of the binding problem described previously involves a transport mechanism that is specific to individual sites. [sent-239, score-0.874]

87 In other words, proteins are allocated to the proximity of sites based on parameters Eij and occupancies αj associated with individual sites. [sent-240, score-0.724]

88 We generalize the game further here by assuming that the transport mechanism has a coarser spatial structure, e. [sent-241, score-0.376]

89 In this extension the amount of protein allocated to any promoter is shared by the sites it contains. [sent-244, score-0.774]

90 Let R represent possible promoter regions each of which may be bound by multiple proteins (at distinct or overlapping sites). [sent-246, score-0.396]

91 Let pi = {pi }r∈R represent an allocation of protein i into these r regions in a manner that is not specific to the possible sites within each promoter. [sent-247, score-0.764]

92 The utility for protein i is given by ui (pi ) = pi Eir (ar ) + βH(pi ) r r∈R where N (r) is the set of possible binding sites within promoter region r and ar = j∈N (r) αj is the overall occupancy of the promoter (how many proteins bound). [sent-248, score-1.908]

93 The protein utility is based on the assumption that the attraction to the promoter decreases based on the number of proteins already bound at the promoter. [sent-252, score-0.716]

94 The maximizing strategy for protein i given ar = j∈N (r) αj for all r, is pi (a) ∝ exp(Eir (ar )/β), where a = {ar }r∈R . [sent-253, score-0.552]

95 r j Sites j ∈ N (r) within a promoter region r reproduce the following chemical equilibrium sj / (f i pi (a) − r i k k∈N (r) si )(1 − αj ) = Kij for all proteins i ∈ P . [sent-254, score-1.192]

96 The result guarantees that if we can solve the fixed point equations within each promoter then the overall occupancies {ar }r∈R have the same monotonicity property as in the simpler version of the game where ar consisted of a single site. [sent-261, score-0.57]

97 5 Discussion We have presented a game theoretic approach to predicting protein arrangements along the DNA. [sent-272, score-0.595]

98 Even in the context of this well-known sub-system, however, few quantitative experimental results are available about binding (see the comparison). [sent-276, score-0.431]

99 Proper validation and use of our model therefore relies on estimating the game parameters from available protein-DNA binding data. [sent-277, score-0.678]

100 Diffusion- driven mechanisms of protein translocation on nucleic acids. [sent-307, score-0.266]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('binding', 0.431), ('sites', 0.295), ('site', 0.266), ('equilibrium', 0.253), ('game', 0.247), ('protein', 0.247), ('proteins', 0.226), ('sj', 0.217), ('gj', 0.205), ('pi', 0.187), ('kij', 0.164), ('promoter', 0.142), ('chemical', 0.133), ('transport', 0.129), ('equilibria', 0.117), ('concentrations', 0.095), ('strategies', 0.093), ('af', 0.09), ('allocated', 0.09), ('ar', 0.082), ('nities', 0.078), ('utilities', 0.078), ('nuclear', 0.077), ('occupancies', 0.077), ('eij', 0.077), ('utility', 0.073), ('theoretic', 0.067), ('allocations', 0.064), ('luis', 0.064), ('promoters', 0.064), ('players', 0.06), ('regulatory', 0.057), ('lysogeny', 0.051), ('dna', 0.044), ('transcription', 0.043), ('fraction', 0.041), ('neighborhoods', 0.041), ('bind', 0.039), ('coli', 0.039), ('cro', 0.039), ('eir', 0.039), ('resource', 0.038), ('monotone', 0.038), ('species', 0.038), ('occupancy', 0.038), ('guided', 0.038), ('concentration', 0.037), ('strategy', 0.036), ('occupied', 0.036), ('proximity', 0.036), ('allocation', 0.035), ('nity', 0.035), ('reproduce', 0.034), ('binds', 0.034), ('arrangements', 0.034), ('rna', 0.031), ('prm', 0.031), ('constants', 0.029), ('tommi', 0.029), ('bound', 0.028), ('ranges', 0.027), ('simulation', 0.027), ('neighborhood', 0.027), ('vj', 0.027), ('escherichia', 0.026), ('frn', 0.026), ('infection', 0.026), ('kik', 0.026), ('ortiz', 0.026), ('phage', 0.026), ('yeang', 0.026), ('switch', 0.025), ('genes', 0.025), ('gene', 0.023), ('ui', 0.023), ('complexes', 0.022), ('decline', 0.022), ('differentiated', 0.022), ('facilitated', 0.022), ('immaterial', 0.022), ('regulators', 0.022), ('viral', 0.022), ('lemma', 0.022), ('overall', 0.022), ('units', 0.022), ('free', 0.022), ('find', 0.021), ('pj', 0.021), ('successively', 0.02), ('transcriptional', 0.02), ('predicts', 0.02), ('predictions', 0.02), ('monotonically', 0.02), ('mark', 0.02), ('local', 0.02), ('characterizing', 0.019), ('mechanisms', 0.019), ('molecular', 0.019), ('interactions', 0.019), ('occupy', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

2 0.21781175 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

3 0.16303067 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

Author: Chris Murray, Geoffrey J. Gordon

Abstract: In real-world planning problems, we must reason not only about our own goals, but about the goals of other agents with which we may interact. Often these agents’ goals are neither completely aligned with our own nor directly opposed to them. Instead there are opportunities for cooperation: by joining forces, the agents can all achieve higher utility than they could separately. But, in order to cooperate, the agents must negotiate a mutually acceptable plan from among the many possible ones, and each agent must trust that the others will follow their parts of the deal. Research in multi-agent planning has often avoided the problem of making sure that all agents have an incentive to follow a proposed joint plan. On the other hand, while game theoretic algorithms handle incentives correctly, they often don’t scale to large planning problems. In this paper we attempt to bridge the gap between these two lines of research: we present an efficient game-theoretic approximate planning algorithm, along with a negotiation protocol which encourages agents to compute and agree on joint plans that are fair and optimal in a sense defined below. We demonstrate our algorithm and protocol on two simple robotic planning problems.1 1 INTRODUCTION We model the multi-agent planning problem as a general-sum stochastic game with cheap talk: the agents observe the state of the world, discuss their plans with each other, and then simultaneously select their actions. The state and actions determine a one-step reward for each player and a distribution over the world’s next state, and the process repeats. While talking allows the agents to coordinate their actions, it cannot by itself solve the problem of trust: the agents might lie or make false promises. So, we are interested in planning algorithms that find subgame-perfect Nash equilibria. In a subgame-perfect equilibrium, every deviation from the plan is deterred by the threat of a suitable punishment, and every threatened punishment is believable. To find these equilibria, planners must reason about their own and other agents’ incentives to deviate: if other agents have incentives to deviate then I can’t trust them, while if I have an incentive to deviate, they can’t trust me. In a given game there may be many subgame-perfect equilibria with widely differing payoffs: some will be better for some agents, and others will be better for other agents. It is generally not feasible to compute all equilibria [1], and even if it were, there would be no obvious way 1 We gratefully acknowledge help and comments from Ron Parr on this research. This work was supported in part by DARPA contracts HR0011-06-0023 (the CS2P program) and 55-00069 (the RADAR program). All opinions, conclusions, and errors are our own. to select one to implement. It does not make sense for the agents to select an equilibrium without consulting one another: there is no reason that agent A’s part of one joint plan would be compatible with agent B’s part of another joint plan. Instead the agents must negotiate, computing and proposing equilibria until they find one which is acceptable to all parties. This paper describes a planning algorithm and a negotiation protocol which work together to ensure that the agents compute and select a subgame-perfect Nash equilibrium which is both approximately Pareto-optimal (that is, its value to any single agent cannot be improved very much without lowering the value to another another agent) and approximately fair (that is, near the so-called Nash bargaining point). Neither the algorithm nor the protocol is guaranteed to work in all games; however, they are guaranteed correct when they are applicable, and applicability is easy to check. In addition, our experiments show that they work well in some realistic situations. Together, these properties of fairness, enforceability, and Pareto optimality form a strong solution concept for a stochastic game. The use of this definition is one characteristic that distinguishes our work from previous research: ours is the first efficient algorithm that we know of to use such a strong solution concept for stochastic games. Our planning algorithm performs dynamic programming on a set-based value function: for P players, at a state s, V ∈ V(s) ⊂ RP is an estimate of the value the players can achieve. We represent V(s) by sampling points on its convex hull. This representation is conservative, i.e., guarantees that we find a subset of the true V∗ (s). Based on the sampled points we can efficiently compute one-step backups by checking which joint actions are enforceable in an equilibrium. Our negotiation protocol is based on a multi-player version of Rubinstein’s bargaining game. Players together enumerate a set of equilibria, and then take turns proposing an equilibrium from the set. Until the players agree, the protocol ends with a small probability after each step and defaults to a low-payoff equilibrium; the fear of this outcome forces players to make reasonable offers. 2 2.1 BACKGROUND STOCHASTIC GAMES A stochastic game represents a multi-agent planning problem in the same way that a Markov Decision Process [2] represents a single-agent planning problem. As in an MDP, transitions in a stochastic game depend on the current state and action. Unlike MDPs, the current (joint) action is a vector of individual actions, one for each player. More formally, a generalsum stochastic game G is a tuple (S, sstart , P, A, T, R, γ). S is a set of states, and sstart ∈ S is the start state. P is the number of players. A = A1 ×A2 ×. . .×AP is the finite set of joint actions. We deal with fully observable stochastic games with perfect monitoring, where all players can observe previous joint actions. T : S × A → P (S) is the transition function, where P (S) is the set of probability distributions over S. R : S × A → RP is the reward function. We will write Rp (s, a) for the pth component of R(s, a). γ ∈ [0, 1) is the discount factor. Player p wants to maximize her discounted total value for the observed sequence ∞ of states and joint actions s1 , a1 , s2 , a2 , . . ., Vp = t=1 γ t−1 Rp (st , at ). A stationary policy for player p is a function πp : S → P (Ap ). A stationary joint policy is a vector of policies π = (π1 , . . . , πP ), one for each player. A nonstationary policy for player p is a function πp : (∪∞ (S × A)t × S) → P (Ap ) which takes a history of states and joint actions and t=0 produces a distribution over player p’s actions; we can define a nonstationary joint policy analogously. For any nonstationary joint policy, there is a stationary policy that achieves the same value at every state [3]. The value function Vpπ : S → R gives expected values for player p under joint policy π. The value vector at state s, Vπ (s), is the vector with components Vpπ (s). (For a nonstationary policy π we will define Vpπ (s) to be the value if s were the start state, and Vpπ (h) to be the value after observing history h.) A vector V is feasible at state s if there is a π for which Vπ (s) = V, and we will say that π achieves V. We will assume public randomization: the agents can sample from a desired joint action distribution in such a way that everyone can verify the outcome. If public randomization is not directly available, there are cryptographic protocols which can simulate it [4]. This assumption means that the set of feasible value vectors is convex, since we can roll a die at the first time step to choose from a set of feasible policies. 2.2 EQUILIBRIA While optimal policies for MDPs can be determined exactly via various algorithms such as linear programming [2], it isn’t clear what it means to find an optimal policy for a general sum stochastic game. So, rather than trying to determine a unique optimal policy, we will define a set of reasonable policies: the Pareto-dominant subgame-perfect Nash equilibria. A (possibly nonstationary) joint policy π is a Nash equilibrium if, for each individual player, no unilateral deviation from the policy would increase that player’s expected value for playing the game. Nash equilibria can contain incredible threats, that is, threats which the agents have no intention of following through on. To remove this possibility, we can define the subgame-perfect Nash equilibria. A policy π is a subgame-perfect Nash equilibrium if it is a Nash equilibrium in every possible subgame: that is, if there is no incentive for any player to deviate after observing any history of joint actions. Finally, consider two policies π and φ. If Vpπ (sstart ) ≥ Vpφ (sstart ) for all players p, and if Vpπ (sstart ) > Vpφ (sstart ) for at least one p, then we will say that π Pareto dominates φ. A policy which is not Pareto dominated by any other policy is Pareto optimal. 2.3 RELATED WORK Littman and Stone [5] give an algorithm for finding Nash equilibria in two-player repeated games. Hansen et al. [6] show how to eliminate very-weakly-dominated strategies in partially observable stochastic games. Doraszelski and Judd [7] show how to compute Markov perfect equilibria in continuous-time stochastic games. The above papers use solution concepts much weaker than Pareto-dominant subgame-perfect equilibrium, and do not address negotiation and coordination. Perhaps the closest work to the current paper is by Brafman and Tennenholtz [8]: they present learning algorithms which, in repeated self-play, find Pareto-dominant (but not subgame-perfect) Nash equilibria in matrix and stochastic games. By contrast, we consider a single play of our game, but allow “cheap talk” beforehand. And, our protocol encourages arbitrary algorithms to agree on Pareto-dominant equilibria, while their result depends strongly on the self-play assumption. 2.3.1 FOLK THEOREMS In any game, each player can guarantee herself an expected discounted value regardless of what actions the other players takes. We call this value the safety value. Suppose that there is a stationary subgame-perfect equilibrium which achieves the safety value for both players; call this the safety equilibrium policy. Suppose that, in a repeated game, some stationary policy π is better for both players than the safety equilibrium policy. Then we can build a subgame-perfect equilibrium with the same payoff as π: start playing π, and if someone deviates, switch to the safety equilibrium policy. So long as γ is sufficiently large, no rational player will want to deviate. This is the folk theorem for repeated games: any feasible value vector which is strictly better than the safety values corresponds to a subgame-perfect Nash equilibrium [9]. (The proof is slightly more complicated if there is no safety equilibrium policy, but the theorem holds for any repeated game.) There is also a folk theorem for general stochastic games [3]. This theorem, while useful, is not strong enough for our purposes: it only covers discount factors γ which are so close to 1 that the players don’t care which state they wind up in after a possible deviation. In most practical stochastic games, discount factors this high are unreasonably patient. When γ is significantly less than 1, the set of equilibrium vectors can change in strange ways as we change γ [10]. Value to player 2 1 0.5 0 0 0.5 1 1.5 2 Value to player 1 2.5 3 Figure 1: Equilibria of a Rubinstein game with γ = 0.8. Shaded area shows feasible value vectors (U1 (x), U2 (x)) for outcomes x. Right-hand circle corresponds to equilibrium when player 1 moves first, left-hand circle when player 2 moves first. The Nash point is at 3. 2.3.2 RUBINSTEIN’S GAME Rubinstein [11] considered a game where two players divide a slice of pie. The first player offers a division x, 1 − x to the second; the second player either accepts the division, or refuses and offers her own division 1 − y, y. The game repeats until some player accepts an offer or until either player gives up. In the latter case neither player gets any pie. Rubinstein showed that if player p’s utility for receiving a fraction x at time t is Up (x, t) = γ t Up (x) for a discount factor 0 ≤ γ < 1 and an appropriate time-independent utility function Up (x) ≥ 0, then rational players will agree on a division near the so-called Nash bargaining point. This is the point which maximizes the product of the utilities that the players gain by cooperating, U1 (x)U2 (1 − x). As γ ↑ 1, the equilibrium will approach the Nash point. See Fig. 1 for an illustration. For three or more players, a similar result holds where agents take turns proposing multi-way divisions of the pie [12]. See the technical report [13] for more detail on the multi-player version of Rubinstein’s game and the Nash bargaining point. 3 NEGOTIATION PROTOCOL The Rubinstein game implicitly assumes that the result of a failure to cooperate is known to all players: nobody gets any pie. The multi-player version of the game assumes in addition that giving one player a share of the pie doesn’t force us to give a share to any other player. Neither of these properties holds for general stochastic games. They are, however, easy to check, and often hold or can be made to hold for planning domains of interest. So, we will assume that the players have agreed beforehand on a subgame-perfect equilibrium π dis , called the disagreement policy, that they will follow in the event of a negotiation failure. In addition, for games with three or more players, we will assume that each player can unilaterally reduce her own utility by any desired amount without affecting other players’ utilities. Given these assumptions, our protocol proceeds in two phases (pseudocode is given in the technical report [13]. In the first phase agents compute subgame-perfect equilibria and take turns revealing them. On an agent’s turn she either reveals an equilibrium or passes; if all agents pass consecutively, the protocol proceeds to the second phase. When an agent states a policy π, the other agents verify that π is a subgame-perfect equilibrium and calculate its payoff vector Vπ (sstart ); players who state non-equilibrium policies miss their turn. At the end of the first phase, suppose the players have revealed a set Π of policies. Define Xp (π) = Vpπ (sstart ) − Vpdis (sstart ) U = convhull {X(π) | π ∈ Π} U = {u ≥ 0 | (∃v ∈ U | u ≤ v)} where Vdis is the value function of π dis , Xp (π) is the excess of policy π for player p, and U is the set of feasible excess vectors. In the second phase, players take turns proposing points u ∈ U along with policies or mixtures of policies in Π that achieve them. After each proposal, all agents except the pro- poser decide whether to accept or reject. If everyone accepts, the proposal is implemented: everyone starts executing the agreed equilibrium. Otherwise, the players who accepted are removed from future negotiation and have their utilities fixed at the proposed levels. Fixing player p’s utility at up means that all future proposals must give p exactly up . Invalid proposals cause the proposer to lose her turn. To achieve this, the proposal may require p to voluntarily lower her own utility; this requirement is enforced by the threat that all players will revert to π dis if p fails to act as required. If at some point we hit the chance of having the current round of communication end, all remaining players are assigned their disagreement values. The players execute the last proposed policy π (or π dis if there has been no valid proposal), and any player p for whom Vpπ (sstart ) is greater than her assigned utility up voluntarily lowers her utility to the correct level. (Again, failure to do so results in all players reverting to π dis .) Under the above protocol, player’s preferences are the same as in a Rubinstein game with utility set U: because we have assumed that negotiation ends with probability after each message, agreeing on u after t additional steps is exactly as good as agreeing on u(1− )t now. So with sufficiently small, the Rubinstein or Krishna-Serrano results show that rational players will agree on a vector u ∈ U which is close to the Nash point argmaxu∈U Πp up . 4 COMPUTING EQUILIBRIA In order to use the protocol of Sec. 3 for bargaining in a stochastic game, the players must be able to compute some subgame-perfect equilibria. Computing equilibria is a hard problem, so we cannot expect real agents to find the entire set of equilibria. Fortunately, each player will want to find the equilibria which are most advantageous to herself to influence the negotiation process in her favor. But equilibria which offer other players reasonably high reward have a higher chance of being accepted in negotiation. So, self interest will naturally distribute the computational burden among all the players. In this section we describe an efficient dynamic-programming algorithm for computing equilibria. The algorithm takes some low-payoff equilibria as input and (usually) outputs higherpayoff equilibria. It is based on the intuition that we can use low-payoff equilibria as enforcement tools: by threatening to switch to an equilibrium that has low value to player p, we can deter p from deviating from a cooperative policy. pun pun In more detail, we will assume that we are given P different equilibria π1 , . . . , πP ; we pun pun dis will use πp to punish player p if she deviates. We can set πp = π for all p if π dis is the only equilibrium we know; or, we can use any other equilibrium policies that we happen pun to have discovered. The algorithm will be most effective when the value of πp to player p is as low as possible in all states. pun We will then search for cooperative policies that we can enforce with the given threats πp . We will first present an algorithm which pretends that we can efficiently take direct sums and convex hulls of arbitrary sets. This algorithm is impractical, but finds all enforceable value vectors. We will then turn it into an approximate algorithm which uses finite data structures to represent the set-valued variables. As we allow more and more storage for each set, the approximate algorithm will approach the exact one; and in any case the result will be a set of equilibria which the agents can execute. 4.1 THE EXACT ALGORITHM Our algorithm maintains a set of value vectors V(s) for each state s. It initializes V(s) to a set which we know contains the value vectors for all equilibrium policies. It then refines V by dynamic programming: it repeatedly attempts to improve the set of values at each state by backing up all of the joint actions, excluding joint actions from which some agent has an incentive to deviate. In more detail, we will compute Vpdis (s) ≡ Vpπdis (s) for all s and p and use the vector Vdis (s) in our initialization. (Recall that we have defined Vpπ (s) for a nonstationary policy π as the value of π if s were the start state.) We also need the values of the punishment policies for Initialization for s ∈ S V(s) ← {V | Vpdis (s) ≤ Vp ≤ Rmax /(1 − γ)} end Repeat until converged for iteration ← 1, 2, . . . for s ∈ S Compute value vector set for each joint action, then throw away unenforceable vectors for a ∈ A Q(s, a) ← {R(s, a)} + γ s ∈S T (s, a)(s )V(s ) Q(s, a) ← {Q ∈ Q(s, a) | Q ≥ Vdev (s, a)} end We can now randomize among joint actions V(s) ← convhull a Q(s, a) end end Figure 2: Dynamic programming using exact operations on sets of value vectors π pun their corresponding players, Vppun (s) ≡ Vp p (s) for all p and s. Given these values, define Qdev (s, a) = Rp (s, a) + γ p T (s, a)(s )Vppun (s ) (1) s ∈S pun to be the value to player p of playing joint action a from state s and then following πp forever after. From the above Qdev values we can compute player p’s value for deviating from an equilibp rium which recommends action a in state s: it is Qdev (s, a ) for the best possible deviation p a , since p will get the one-step payoff for a but be punished by the rest of the players starting on the following time step. That is, Vpdev (s, a) = max Qdev (s, a1 × . . . × ap × . . . × aP ) p ap ∈Ap (2) Vpdev (s, a) is the value we must achieve for player p in state s if we are planning to recommend pun action a and punish deviations with πp : if we do not achieve this value, player p would rather deviate and be punished. Our algorithm is shown in Fig. 2. After k iterations, each vector in V(s) corresponds to a k-step policy in which no agent ever has an incentive to deviate. In the k + 1st iteration, the first assignment to Q(s, a) computes the value of performing action a followed by any k-step policy. The second assignment throws out the pairs (a, π) for which some agent would want to deviate from a given that the agents plan to follow π in the future. And the convex hull accounts for the fact that, on reaching state s, we can select an action a and future policy π at random from the feasible pairs.2 Proofs of convergence and correctness of the exact algorithm are in the technical report [13]. Of course, we cannot actually implement the algorithm of Fig. 2, since it requires variables whose values are convex sets of vectors. But, we can approximate V(s) by choosing a finite set of witness vectors W ⊂ RP and storing V(s, w) = arg maxv∈V(s) (v·w) for each w ∈ W. V(s) is then approximated by the convex hull of {V(s, w) | w ∈ W}. If W samples the P dimensional unit hypersphere densely enough, the maximum possible approximation error will be small. (In practice, each agent will probably want to pick W differently, to focus her computation on policies in the portion of the Pareto frontier where her own utility is relatively high.) As |W| increases, the error introduced at each step will go to zero. The approximate algorithm is given in more detail in the technical report [13]. 2 It is important for this randomization to occur after reaching state s to avoid introducing incentives to deviate, and it is also important for the randomization to be public. P1 1 P2 1 P1 1 P2 1 P1 1 P2 1 P2 2 P1 2 P2 2 P1 2 P2 2 P1 2 Figure 3: Execution traces for our motion planning example. Left and Center: with 2 witness vectors , the agents randomize between two selfish paths. Right: with 4–32 witnesses, the agents find a cooperative path. Steps where either player gets a goal are marked with ×. 90 shop C D E D Value to Player 2 85 A 80 75 70 ABC 65 E D A B 60 40 50 60 Value to Player 1 70 80 Figure 4: Supply chain management problem. In the left figure, Player 1 is about to deliver part D to the shop, while player 2 is at the warehouse which sells B. The right figure shows the tradeoff between accuracy and computation time. The solid curve is the Pareto frontier for sstart , as computed using 8 witnesses per state. The dashed and dotted lines were computed using 2 and 4 witnesses, respectively. Dots indicate computed value vectors; × marks indicate the Nash points. 5 EXPERIMENTS We tested our value iteration algorithm and negotiation procedure on two robotic planning domains: a joint motion planning problem and a supply-chain management problem. In our motion planning problem (Fig. 3), two players together control a two-wheeled robot, with each player picking the rotational velocity for one wheel. Each player has a list of goal landmarks which she wants to cycle through, but the two players can have different lists of goals. We discretized states based on X, Y, θ and the current goals, and discretized actions into stop, slow (0.45 m ), and fast (0.9 m ), for 9 joint actions and about 25,000 states. s s We discretized time at ∆t = 1s, and set γ = 0.99. For both the disagreement policy and all punishment policies, we used “always stop,” since by keeping her wheel stopped either player can prevent the robot from moving. Planning took a few hours of wall clock time on a desktop workstation for 32 witnesses per state. Based on the planner’s output, we ran our negotiation protocol to select an equilibrium. Fig. 3 shows the results: with limited computation the players pick two selfish paths and randomize equally between them, while with more computation they find the cooperative path. Our experiments also showed that limiting the computation available to one player allows the unrestricted player to reveal only some of the equilibria she knows about, tilting the outcome of the negotiation in her favor (see the technical report [13] for details). For our second experiment we examined a more realistic supply-chain problem. Here each player is a parts supplier competing for the business of an engine manufacturer. The manufacturer doesn’t store items and will only pay for parts which can be used immediately. Each player controls a truck which moves parts from warehouses to the assembly shop; she pays for parts when she picks them up, and receives payment on delivery. Each player gets parts from different locations at different prices and no one player can provide all of the parts the manufacturer needs. Each player’s truck can be at six locations along a line: four warehouse locations (each of which provides a different type of part), one empty location, and the assembly shop. Building an engine requires five parts, delivered in the order A, {B, C}, D, E (parts B and C can arrive in either order). After E, the manufacturer needs A again. Players can move left or right along the line at a small cost, or wait for free. They can also buy parts at a warehouse (dropping any previous cargo), or sell their cargo if they are at the shop and the manufacturer wants it. Each player can only carry one part at a time and only one player can make a delivery at a time. Finally, any player can retire and sell her truck; in this case the game ends and all players get the value of their truck plus any cargo. The disagreement policy is for all players to retire at all states. Fig. 4 shows the computed sets V(sstart ) for various numbers of witnesses. The more witnesses we use, the more accurately we represent the frontier, and the closer our final policy is to the true Nash point. All of the policies computed are “intelligent” and “cooperative”: a human observer would not see obvious ways to improve them, and in fact would say that they look similar despite their differing payoffs. Players coordinate their motions, so that one player will drive out to buy part E while the other delivers part D. They sit idle only in order to delay the purchase of a part which would otherwise be delivered too soon. 6 CONCLUSION Real-world planning problems involve negotiation among multiple agents with varying goals. To take all agents incentives into account, the agents should find and agree on Paretodominant subgame-perfect Nash equilibria. For this purpose, we presented efficient planning and negotiation algorithms for general-sum stochastic games, and tested them on two robotic planning problems. References [1] V. Conitzer and T. Sandholm. Complexity results about Nash equilibria. Technical Report CMU-CS-02-135, School of Computer Science, Carnegie-Mellon University, 2002. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Massachusetts, 1995. [3] Prajit K. Dutta. A folk theorem for stochastic games. Journal of Economic Theory, 66:1–32, 1995. [4] Yevgeniy Dodis, Shai Halevi, and Tal Rabin. A cryptographic solution to a game theoretic problem. In Lecture Notes in Computer Science, volume 1880, page 112. Springer, Berlin, 2000. [5] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, pages 48–54. ACM, 2003. [6] E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, pages 709–715, 2004. [7] Ulrich Doraszelski and Kenneth L. Judd. Avoiding the curse of dimensionality in dynamic stochastic games. NBER Technical Working Paper No. 304, January 2005. [8] R. Brafman and M. Tennenholtz. Efficient learning equilibrium. Artificial Intelligence, 2004. [9] D Fudenberg and E. Maskin. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 1986. [10] David Levine. The castle on the hill. Review of Economic Dynamics, 3(2):330–337, 2000. [11] Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica, 50(1):97–109, 1982. [12] V. Krishna and R. Serrano. Multilateral bargaining. Review of Economic Studies, 1996. [13] Chris Murray and Geoffrey J. Gordon. Multi-robot negotiation: approximating the set of subgame perfect equilibria in general-sum stochastic games. Technical Report CMU-ML-06114, Carnegie Mellon University, 2006.

4 0.1368854 14 nips-2006-A Small World Threshold for Economic Network Formation

Author: Eyal Even-dar, Michael Kearns

Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1

5 0.12288638 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

Author: Neil D. Lawrence, Guido Sanguinetti, Magnus Rattray

Abstract: Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.

6 0.1184791 13 nips-2006-A Scalable Machine Learning Approach to Go

7 0.093705714 155 nips-2006-Optimal Single-Class Classification Strategies

8 0.085141875 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

9 0.073119715 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

10 0.068135366 167 nips-2006-Recursive ICA

11 0.061640173 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

12 0.052253969 61 nips-2006-Convex Repeated Games and Fenchel Duality

13 0.051459353 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

14 0.051175557 75 nips-2006-Efficient sparse coding algorithms

15 0.048711117 146 nips-2006-No-regret Algorithms for Online Convex Programs

16 0.04720097 50 nips-2006-Chained Boosting

17 0.045213465 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

18 0.043614887 57 nips-2006-Conditional mean field

19 0.04307726 117 nips-2006-Learning on Graph with Laplacian Regularization

20 0.041805588 130 nips-2006-Max-margin classification of incomplete data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.123), (1, 0.006), (2, -0.11), (3, -0.111), (4, 0.115), (5, -0.142), (6, -0.103), (7, 0.222), (8, -0.186), (9, 0.098), (10, -0.104), (11, 0.111), (12, -0.013), (13, -0.057), (14, -0.093), (15, 0.028), (16, 0.018), (17, -0.08), (18, -0.001), (19, -0.067), (20, 0.003), (21, -0.012), (22, -0.062), (23, -0.155), (24, 0.068), (25, 0.002), (26, -0.014), (27, -0.057), (28, -0.006), (29, 0.044), (30, -0.042), (31, -0.134), (32, 0.035), (33, 0.059), (34, 0.118), (35, 0.02), (36, 0.094), (37, 0.024), (38, 0.123), (39, 0.135), (40, -0.013), (41, 0.013), (42, 0.055), (43, -0.023), (44, -0.013), (45, -0.03), (46, -0.003), (47, 0.044), (48, 0.045), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98409116 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

2 0.69350463 26 nips-2006-An Approach to Bounded Rationality

Author: Eli Ben-sasson, Ehud Kalai, Adam Kalai

Abstract: A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of “bounded rationality,” a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner’s dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for $100K prize. Suppose the players simultaneously choose among two available options: to use a $10K program A or an advanced program B, which costs $50K. We refer to the row chooser as white and to the column chooser as black, with the corresponding advantages reflected by the win probabilities of white described in Table 1a. For example, when both players use program A, white wins 55% of the time and black wins 45% of the time (we ignore draws). The players naturally want to choose strategies to maximize their expected net payoffs, i.e., their expected payoff minus their cost. Each cell in Table 1b contains a pair of payoffs in units of thousands of dollars; the first is white’s net expected payoff and the second is black’s. a) A B A 55% 93% B 13% 51% b) A (-10) B (-50) A (-10) 45, 35 43,-3 B (-50) 3, 37 1,-1 Figure 1: a) Table of first-player winning probabilities based on program choices. b) Table of expected net earnings in thousands of dollars. The unique equilibrium is (A,B) which strongly favors the second player. A surprising property is evident in the above game. Everything about the game seems to favor white. Yet due to the (symmetric) costs, at the unique Nash equilibrium (A,B) of Table 1b, black wins 87% of the time and nets $34K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white’s probability of winning by 38%, independent of what black chooses. Since the pot is $100K, this is worth $38K in expectation, but B costs $40K more than A. On the other hand, black enjoys a 42% increase in probability of winning due to B, independent of what white does, and hence is willing to pay the extra $40K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in “only” two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter’s route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N, S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 × . . . × SN → RN . Players simultaneously choose strategies si ∈ Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy σi for player i is a probability distribution over its pure strategies Si , σi ∈ ∆i = x ∈ R|Si | : xj = 1, xj ≥ 0 . We extend p to ∆1 × . . . × ∆N in the natural way, i.e., pi (σ1 , . . . , σN ) = E[pi (s1 , . . . , sN )] where each si is drawn from σi , independently. Denote by s−i = (s1 , s2 , . . . , si−1 , si+1 , . . . , sN ) and similarly for σ−i . A best response by player i to σ−i is σi ∈ ∆i such that pi (σi , σ−i ) = maxσi ∈∆i pi (σi , σ−i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (σ1 , . . . , σN ) ∈ ∆1 × . . . × ∆N such that each σi is a best response to σ−i . We now define G−c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si → R. It is simply an N -person normal-form game G−c = (N, S, p−c ) with the same sets of pure strategies as G, but with a new payoff function p−c : S1 × . . . × SN → RN where, p−c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) − ci (si ), for i = 1, . . . , N. i We similarly extend ci to ∆i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k-sum for short) if at every combination of individual strategies, the players’ payoffs sum to some constant k. Two-person k-sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k-sum game has a unique value v ∈ R. A mixed strategy for player 1 is called optimal if it guarantees payoff ≥ v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees ≥ k − v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k − v = k) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors – randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k-sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k-sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k-sum game is no longer k-sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k-sum game and G−c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v ∈ R for G−c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff ≥ v − c2 (σ2 ), against any strategy σ2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff ≥ k − v − c1 (σ1 ) against any σ1 . 2. The Nash equilibria of G−c are exchangeable: the set of Nash equilibria is OPT1 ×OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k − v, respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player’s action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k − c1 (σ1 ) − c2 (σ2 ), which is exactly the total of what optimal strategies guarantee, v − c2 (σ2 ) + k − v − c1 (σ1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy σ1 that does not guarantee him payoff at least v − c2 (σ2 ). This means that there is some response σ2 by player 2 for which player 1’s payoff is < v − c2 (σ2 ) and hence player 2’s payoff is > k − v − c1 (σ1 ). Thus player 2’s best response to σ1 must give player 2 payoff > k − v − c1 (σ1 ) and leave player 1 with < v − c2 (σ2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k-sum game H = (N, S, q) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) − c1 (s1 ) + c2 (s2 ) = p−c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) − c2 (s1 ) + c1 (s1 ) = p−c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, ∀σ1 , σ1 ∈ ∆1 , σ2 ∈ ∆2 q1 (σ1 , σ2 ) − q1 (σ1 , σ2 ) = p−c (σ1 , σ2 ) − p−c (σ1 , σ2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G−c and H are the same. In particular, σ1 is a best response to σ2 in G−c if and only if it is in H. A similar equality holds for player 2’s payoffs. Note that these conditions imply that the games G−c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H. For any strategy σ1 that guarantees player 1 payoff ≥ v in H, σ1 guarantees player 1 ≥ v − c2 (σ2 ) in G−c . This follows from the definition of H. Similarly, any strategy σ2 that guarantees player 2 payoff ≥ k − v in H will guarantee ≥ k − v − c1 (σ1 ) in G−c . Thus the sets OPT1 and OPT2 are non-empty. Since v − c2 (σ2 ) + k − v − c1 (σ1 ) = k − c1 (σ1 ) − c2 (σ2 ) is the sum of the payoffs in G−c , nothing greater can be guaranteed by either player. Since the best responses of G−c and H are the same, the Nash equilibria of the two games are the same. Since H is a k-sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k-sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k-sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G−c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 × 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n × n game, as it amounts to computing the equilibria of H. We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k-sum games with strategy costs Another desirable property of k-sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k-sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N, S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the “advantage of an advantage” property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) ≥ p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2’s. a) A B C D A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 B 5, 5 6, 4 6.5, 3.5 7, 3 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 b) A (-1) B (-2) C (-3) D (-4) A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF Nash Eq. A,D value A,C B,D B,C C,D C,C D,D D,C A,B A,A B,B B,A C,B C,A D,B D,A PLAYER 1 NET PAYOFF Figure 2: a) Payoffs in 10-sum game G. b) Expected net earnings in G−c . OPT1 is any mixture of A and B, and OPT2 is any mixture of C and D. Each player’s choice of equilibrium strategy affects only the opponent’s net payoff. c) A graphical display of the payoff pairs. The shaded region shows the rectangular set of payoffs achievable at mixed strategy Nash equilibria. Perhaps the most intuitive dynamics are best-response: at each stage, each player selects a best response to the opponent’s previous stage play. Unfortunately, these naive dynamics fails to converge to equilibrium in very simple examples. The fictitious play dynamics prescribe, at stage t, selecting any strategy that is a best response to the empirical distribution of opponent’s play during the first t − 1 stages. It has been shown that fictitious play converges to equilibrium (of the stage game G) in k-sum games [17]. However, fictitious play requires perfect monitoring. One can learn to play a two-person k-sum game with no knowledge of the payoff table or anything about the other players actions. Using experimentation, the only observations required by each player are its own payoffs in each period (in addition to the number of available actions). So-called bandit algorithms [7] must manage the exploration-exploitation tradeoff. The proof of their convergence follows from the fact that they are no-regret algorithms. (No-regret algorithms date back to Hannan in the 1950’s [12], but his required perfect monitoring). The regret of a player i at stage T is defined to be, T regret of i at T = 1 max pi (si , st ) − pi (st , st ) , −i i −i T si ∈Si t=1 that is, how much better in hindsight player i could have done on the first T stages had it used one fixed strategy the whole time (and had the opponents not changed their strategies). Note that regret can be positive or negative. A no-regret algorithm is one in which each player’s asymptotic regret converges to (−∞, 0], i.e., is guaranteed to approach 0 or less. It is well-known that noregret condition in two-person k-sum games implies convergence to equilibrium (see, e.g., [13]). In particular, the pair of mixed strategies which are the empirical distributions of play over time approaches the set of Nash equilibrium of the stage game. Inverse-polynomial rates of convergence (that are polynomial also in the size of the game) can be given for such algorithms. Hence no-regret algorithms provide arguably reasonable ways to play a k-sum game of moderate size. Note that in general-sum games, no such dynamics are known. Fortunately, the same algorithm that works for learning in k-sum games seem to work for learning in such games with strategy costs. Theorem 2. Fictitious play converges to the set of Nash equilibria of the stage game in a two-person k-sum game with strategy costs, as do no-regret learning dynamics. Proof. The proof again follows from equation (1) regarding the game H. Fictitious play dynamics are defined only in terms of best response play. Since G−c and H share the same best responses, fictitious play dynamics are identical for the two games. Since they share the same equilibria and fictitious play converges to equilibria in H, it must converge in G−c as well. For no-regret algorithms, equation (1) again implies that for any play sequence, the regret of each player i with respect to game G−c is the same as its regret with respect to the game H. Hence, no regret in G−c implies no regret in H. Since no-regret algorithms converge to the set of equilibria in k-sum games, they converge to the set of equilibria in H and therefore in G−c as well. 4 Potential games with strategic costs Let us begin with an example of a potential game, called a routing game [18]. There is a fixed directed graph with n nodes and m edges. Commuters i = 1, 2, . . . , N each decide on a route πi , to take from their home si to their work ti , where si and ti are nodes in the graph. For each edge, uv, let nuv be the number of commuters whose path πi contains edge uv. Let fuv : Z → R be a nonnegative monotonically increasing congestion function. Player i’s payoff is − uv∈πi fuv (nuv ), i.e., the negative sum of the congestions on the edges in its path. An N -person normal form game G is said to be a potential game [15] if there is some potential function Φ : S1 × . . . SN → R such that changing a single player’s action changes its payoff by the change in the potential function. That is, there exists a single function Φ, such that for all players i and all pure strategy vectors s, s ∈ S1 × . . . × SN that differ only in the ith coordinate, pi (s) − pi (s ) = Φ(s) − Φ(s ). (2) Potential games have appealing learning properties: simple better-reply dynamics converge to purestrategy Nash equilibria, as do the more sophisticated fictitious-play dynamics described earlier [15]. In our example, this means that if players change their individual paths so as to selfishly reduce the sum of congestions on their path, this will eventually lead to an equilibrium where no one can improve. (This is easy to see because Φ keeps increasing.) The absence of similar learning properties for general games presents a frustrating hole in learning and game theory. It is clear that the theoretically clean commuting example above misses some realistic considerations. One issue regarding complexity is that most commuters would not be willing to take a very complicated route just to save a short amount of time. To model this, we consider potential games with strategy costs. In our example, this would be a cost associated with every path. For example, suppose the graph represented streets in a given city. We consider a natural strategy complexity cost associated with a route π, say λ(#turns(π))2 , where there is a parameter λ ∈ R and #turns(π) is defined as the number of times that a commuter has to turn on a route. (To be more precise, say each edge in the graph is annotated with a street name, and a turn is defined to be a pair of consecutive edges in the graph with different street names.) Hence, a best response for player i would minimize: π min (total congestion of π) + λ(#turns(π))2 . from si to ti While adding strategy costs to potential games allows for much more flexibility in model design, one might worry that appealing properties of potential games, such as having pure strategy equilibria and easy learning dynamics, no longer hold. This is not the case. We show that strategic costs fit easily into the potential game framework: Theorem 3. For any potential game G and any cost functions c, G−c is also a potential game. Proof. Let Φ be a potential function for G. It is straightforward to verify that the G−c admits the following potential function Φ : Φ (s1 , . . . , sN ) = Φ(s1 , . . . , sN ) − c1 (s1 ) − . . . − cN (sN ). 5 Additional remarks Part of the reason that the notion of bounded rationality is so difficult to formalize is that understanding enormous games like chess is a daunting proposition. That is why we have narrowed it down to choosing among a small number of available programs. A game theorist might begin by examining the complete payoff table of Figure 1a, which is prohibitively large. Instead of considering only the choices of programs A and B, each player considers all possible chess strategies. In that sense, our payoff table in 1a would be viewed as a reduction of the “real” normal form game. A computer scientist, on the other hand, may consider it reasonable to begin with the existing strategies that one has access to. Regardless of how you view the process, it is clear that for practical purposes players in real life do simplify and analyze “smaller” sets of strategies. Even if the players consider the option of engineering new chess-playing software, this can be viewed as a third strategy in the game, with its own cost and expected payoffs. Again, when considering small number of available strategies, like the two programs above, it may still be difficult to assess the expected payoffs that result when (possibly randomized) strategies play against each other. An additional assumption made throughout the paper is that the players share the same assessments about these expected payoffs. Like other common-knowledge assumptions made in game theory, it would be desirable to weaken this assumption. In the special families of games studied in this paper, and perhaps in additional cases, learning algorithms may be employed to reach equilibrium without knowledge of payoffs. 5.1 Finite automata playing repeated games There has been a large body of interesting work on repeated games played by finite automata (see [14] for a survey). Much of this work is on achieving cooperation in the classic prisoner’s dilemma game (e.g., [2, 3, 4, 5]). Many of these models can be incorporated into the general model outlined in this paper. For example, to view the Abreu and Rubinstein model [6] as such, consider the normal form of an infinitely repeated game with discounting, but restricted to strategies that can be described by finite automata (the payoffs in every cell of the payoff table are the discounted sums of the infinite streams of payoffs obtained in the repeated game). Let the cost of a strategy be an increasing function of the number of states it employs. For Neyman’s model [3], consider the normal form of a finitely repeated game with a known number of repetitions. You may consider strategies in this normal form to be only ones with a bounded number of states, as required by Neyman, and assign zero cost to all strategies. Alternatively, you may allow all strategies but assign zero cost to ones that employ number of states below Neyman’s bounds, and an infinite cost to strategies that employ a number of states that exceeds Neyman’s bounds. The structure of equilibria proven in Theorem 1 applies to all the above models when dealing with repeated k-sum games, as in [2]. 6 Future work There are very interesting questions to answer about bounded rationality in truly large games that we did not touch upon. For example, consider the factoring game from the introduction. A pure strategy for Player 1 would be outputting a single n-bit number. A pure strategy for Player 2 would be any factoring program, described by a circuit that takes as input an n-bit number and attempts to output a representation of its prime factorization. The complexity of such a strategy would be an increasing function of the number of gates in the circuit. It would be interesting to make connections between asymptotic algorithm complexity and games. Another direction regards an elegant line of work on learning to play correlated equilibria by repeated play [11]. It would be natural to consider how strategy costs affect correlated equilibria. Finally, it would also be interesting to see how strategy costs affect the so-called “price of anarchy” [19] in congestion games. Acknowledgments This work was funded in part by a U.S. NSF grant SES-0527656, a Landau Fellowship supported by the Taub and Shalom Foundations, a European Community International Reintegration Grant, an Alon Fellowship, ISF grant 679/06, and BSF grant 2004092. Part of this work was done while the first and second authors were at the Toyota Technological Institute at Chicago. References [1] H. Simon. The sciences of the artificial. MIT Press, Cambridge, MA, 1969. [2] E. Ben-Porath. Repeated games with finite automata, Journal of Economic Theory 59: 17–32, 1993. [3] A. Neyman. Bounded Complexity Justifies Cooperation in the Finitely Repeated Prisoner’s Dilemma. Economic Letters, 19: 227–229, 1985. [4] A. Rubenstein. Finite automata play the repeated prisoner’s dilemma. Journal of Economic Theory, 39:83– 96, 1986. [5] C. Papadimitriou, M. Yannakakis: On complexity as bounded rationality. In Proceedings of the TwentySixth Annual ACM Symposium on Theory of Computing, pp. 726–733, 1994. [6] D. Abreu and A. Rubenstein. The Structure of Nash Equilibrium in Repeated Games with Finite Automata. Econometrica 56:1259-1281, 1988. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32(1):48-77, 2002. [8] X. Chen, X. Deng, and S. Teng. Computing Nash Equilibria: Approximation and smoothed complexity. Electronic Colloquium on Computational Complexity Report TR06-023, 2006. [9] K. Daskalakis, P. Goldberg, C. Papadimitriou. The complexity of computing a Nash equilibrium. Electronic Colloquium on Computational Complexity Report TR05-115, 2005. [10] C. Ewerhart. Chess-like Games Are Dominance Solvable in at Most Two Steps. Games and Economic Behavior, 33:41-47, 2000. [11] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40-55, 1997. [12] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pp. 97–139. Princeton University Press, 1957. [13] S. Hart and A. Mas-Colell. A General Class of Adaptive Strategies. Journal of Economic Theory 98(1):26– 54, 2001. [14] E. Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi, A. Neyman, and Y. Tauman, editors, Game Theory and Applications, pp. 131–157. Academic Press, San Diego, 1990. [15] D. Monderer, L. Shapley. Potential games. Games and Economic Behavior, 14:124–143, 1996. [16] H. Moulin and P. Vial. Strategically Zero Sum Games: the Class of Games Whose Completely Mixed Equilibria Cannot Be Improved Upon. International Journal of Game Theory, 7:201–221, 1978. [17] J. Robinson, An iterative method of solving a game, Ann. Math. 54:296–301, 1951. [18] R. Rosenthal. A Class of Games Possessing Pure-Strategy Nash Equilibria. International Journal of Game Theory, 2:65–67, 1973. [19] E. Koutsoupias and C. Papadimitriou. Worstcase equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science, pp. 404–413, 1999.

3 0.60404581 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

Author: Chris Murray, Geoffrey J. Gordon

Abstract: In real-world planning problems, we must reason not only about our own goals, but about the goals of other agents with which we may interact. Often these agents’ goals are neither completely aligned with our own nor directly opposed to them. Instead there are opportunities for cooperation: by joining forces, the agents can all achieve higher utility than they could separately. But, in order to cooperate, the agents must negotiate a mutually acceptable plan from among the many possible ones, and each agent must trust that the others will follow their parts of the deal. Research in multi-agent planning has often avoided the problem of making sure that all agents have an incentive to follow a proposed joint plan. On the other hand, while game theoretic algorithms handle incentives correctly, they often don’t scale to large planning problems. In this paper we attempt to bridge the gap between these two lines of research: we present an efficient game-theoretic approximate planning algorithm, along with a negotiation protocol which encourages agents to compute and agree on joint plans that are fair and optimal in a sense defined below. We demonstrate our algorithm and protocol on two simple robotic planning problems.1 1 INTRODUCTION We model the multi-agent planning problem as a general-sum stochastic game with cheap talk: the agents observe the state of the world, discuss their plans with each other, and then simultaneously select their actions. The state and actions determine a one-step reward for each player and a distribution over the world’s next state, and the process repeats. While talking allows the agents to coordinate their actions, it cannot by itself solve the problem of trust: the agents might lie or make false promises. So, we are interested in planning algorithms that find subgame-perfect Nash equilibria. In a subgame-perfect equilibrium, every deviation from the plan is deterred by the threat of a suitable punishment, and every threatened punishment is believable. To find these equilibria, planners must reason about their own and other agents’ incentives to deviate: if other agents have incentives to deviate then I can’t trust them, while if I have an incentive to deviate, they can’t trust me. In a given game there may be many subgame-perfect equilibria with widely differing payoffs: some will be better for some agents, and others will be better for other agents. It is generally not feasible to compute all equilibria [1], and even if it were, there would be no obvious way 1 We gratefully acknowledge help and comments from Ron Parr on this research. This work was supported in part by DARPA contracts HR0011-06-0023 (the CS2P program) and 55-00069 (the RADAR program). All opinions, conclusions, and errors are our own. to select one to implement. It does not make sense for the agents to select an equilibrium without consulting one another: there is no reason that agent A’s part of one joint plan would be compatible with agent B’s part of another joint plan. Instead the agents must negotiate, computing and proposing equilibria until they find one which is acceptable to all parties. This paper describes a planning algorithm and a negotiation protocol which work together to ensure that the agents compute and select a subgame-perfect Nash equilibrium which is both approximately Pareto-optimal (that is, its value to any single agent cannot be improved very much without lowering the value to another another agent) and approximately fair (that is, near the so-called Nash bargaining point). Neither the algorithm nor the protocol is guaranteed to work in all games; however, they are guaranteed correct when they are applicable, and applicability is easy to check. In addition, our experiments show that they work well in some realistic situations. Together, these properties of fairness, enforceability, and Pareto optimality form a strong solution concept for a stochastic game. The use of this definition is one characteristic that distinguishes our work from previous research: ours is the first efficient algorithm that we know of to use such a strong solution concept for stochastic games. Our planning algorithm performs dynamic programming on a set-based value function: for P players, at a state s, V ∈ V(s) ⊂ RP is an estimate of the value the players can achieve. We represent V(s) by sampling points on its convex hull. This representation is conservative, i.e., guarantees that we find a subset of the true V∗ (s). Based on the sampled points we can efficiently compute one-step backups by checking which joint actions are enforceable in an equilibrium. Our negotiation protocol is based on a multi-player version of Rubinstein’s bargaining game. Players together enumerate a set of equilibria, and then take turns proposing an equilibrium from the set. Until the players agree, the protocol ends with a small probability after each step and defaults to a low-payoff equilibrium; the fear of this outcome forces players to make reasonable offers. 2 2.1 BACKGROUND STOCHASTIC GAMES A stochastic game represents a multi-agent planning problem in the same way that a Markov Decision Process [2] represents a single-agent planning problem. As in an MDP, transitions in a stochastic game depend on the current state and action. Unlike MDPs, the current (joint) action is a vector of individual actions, one for each player. More formally, a generalsum stochastic game G is a tuple (S, sstart , P, A, T, R, γ). S is a set of states, and sstart ∈ S is the start state. P is the number of players. A = A1 ×A2 ×. . .×AP is the finite set of joint actions. We deal with fully observable stochastic games with perfect monitoring, where all players can observe previous joint actions. T : S × A → P (S) is the transition function, where P (S) is the set of probability distributions over S. R : S × A → RP is the reward function. We will write Rp (s, a) for the pth component of R(s, a). γ ∈ [0, 1) is the discount factor. Player p wants to maximize her discounted total value for the observed sequence ∞ of states and joint actions s1 , a1 , s2 , a2 , . . ., Vp = t=1 γ t−1 Rp (st , at ). A stationary policy for player p is a function πp : S → P (Ap ). A stationary joint policy is a vector of policies π = (π1 , . . . , πP ), one for each player. A nonstationary policy for player p is a function πp : (∪∞ (S × A)t × S) → P (Ap ) which takes a history of states and joint actions and t=0 produces a distribution over player p’s actions; we can define a nonstationary joint policy analogously. For any nonstationary joint policy, there is a stationary policy that achieves the same value at every state [3]. The value function Vpπ : S → R gives expected values for player p under joint policy π. The value vector at state s, Vπ (s), is the vector with components Vpπ (s). (For a nonstationary policy π we will define Vpπ (s) to be the value if s were the start state, and Vpπ (h) to be the value after observing history h.) A vector V is feasible at state s if there is a π for which Vπ (s) = V, and we will say that π achieves V. We will assume public randomization: the agents can sample from a desired joint action distribution in such a way that everyone can verify the outcome. If public randomization is not directly available, there are cryptographic protocols which can simulate it [4]. This assumption means that the set of feasible value vectors is convex, since we can roll a die at the first time step to choose from a set of feasible policies. 2.2 EQUILIBRIA While optimal policies for MDPs can be determined exactly via various algorithms such as linear programming [2], it isn’t clear what it means to find an optimal policy for a general sum stochastic game. So, rather than trying to determine a unique optimal policy, we will define a set of reasonable policies: the Pareto-dominant subgame-perfect Nash equilibria. A (possibly nonstationary) joint policy π is a Nash equilibrium if, for each individual player, no unilateral deviation from the policy would increase that player’s expected value for playing the game. Nash equilibria can contain incredible threats, that is, threats which the agents have no intention of following through on. To remove this possibility, we can define the subgame-perfect Nash equilibria. A policy π is a subgame-perfect Nash equilibrium if it is a Nash equilibrium in every possible subgame: that is, if there is no incentive for any player to deviate after observing any history of joint actions. Finally, consider two policies π and φ. If Vpπ (sstart ) ≥ Vpφ (sstart ) for all players p, and if Vpπ (sstart ) > Vpφ (sstart ) for at least one p, then we will say that π Pareto dominates φ. A policy which is not Pareto dominated by any other policy is Pareto optimal. 2.3 RELATED WORK Littman and Stone [5] give an algorithm for finding Nash equilibria in two-player repeated games. Hansen et al. [6] show how to eliminate very-weakly-dominated strategies in partially observable stochastic games. Doraszelski and Judd [7] show how to compute Markov perfect equilibria in continuous-time stochastic games. The above papers use solution concepts much weaker than Pareto-dominant subgame-perfect equilibrium, and do not address negotiation and coordination. Perhaps the closest work to the current paper is by Brafman and Tennenholtz [8]: they present learning algorithms which, in repeated self-play, find Pareto-dominant (but not subgame-perfect) Nash equilibria in matrix and stochastic games. By contrast, we consider a single play of our game, but allow “cheap talk” beforehand. And, our protocol encourages arbitrary algorithms to agree on Pareto-dominant equilibria, while their result depends strongly on the self-play assumption. 2.3.1 FOLK THEOREMS In any game, each player can guarantee herself an expected discounted value regardless of what actions the other players takes. We call this value the safety value. Suppose that there is a stationary subgame-perfect equilibrium which achieves the safety value for both players; call this the safety equilibrium policy. Suppose that, in a repeated game, some stationary policy π is better for both players than the safety equilibrium policy. Then we can build a subgame-perfect equilibrium with the same payoff as π: start playing π, and if someone deviates, switch to the safety equilibrium policy. So long as γ is sufficiently large, no rational player will want to deviate. This is the folk theorem for repeated games: any feasible value vector which is strictly better than the safety values corresponds to a subgame-perfect Nash equilibrium [9]. (The proof is slightly more complicated if there is no safety equilibrium policy, but the theorem holds for any repeated game.) There is also a folk theorem for general stochastic games [3]. This theorem, while useful, is not strong enough for our purposes: it only covers discount factors γ which are so close to 1 that the players don’t care which state they wind up in after a possible deviation. In most practical stochastic games, discount factors this high are unreasonably patient. When γ is significantly less than 1, the set of equilibrium vectors can change in strange ways as we change γ [10]. Value to player 2 1 0.5 0 0 0.5 1 1.5 2 Value to player 1 2.5 3 Figure 1: Equilibria of a Rubinstein game with γ = 0.8. Shaded area shows feasible value vectors (U1 (x), U2 (x)) for outcomes x. Right-hand circle corresponds to equilibrium when player 1 moves first, left-hand circle when player 2 moves first. The Nash point is at 3. 2.3.2 RUBINSTEIN’S GAME Rubinstein [11] considered a game where two players divide a slice of pie. The first player offers a division x, 1 − x to the second; the second player either accepts the division, or refuses and offers her own division 1 − y, y. The game repeats until some player accepts an offer or until either player gives up. In the latter case neither player gets any pie. Rubinstein showed that if player p’s utility for receiving a fraction x at time t is Up (x, t) = γ t Up (x) for a discount factor 0 ≤ γ < 1 and an appropriate time-independent utility function Up (x) ≥ 0, then rational players will agree on a division near the so-called Nash bargaining point. This is the point which maximizes the product of the utilities that the players gain by cooperating, U1 (x)U2 (1 − x). As γ ↑ 1, the equilibrium will approach the Nash point. See Fig. 1 for an illustration. For three or more players, a similar result holds where agents take turns proposing multi-way divisions of the pie [12]. See the technical report [13] for more detail on the multi-player version of Rubinstein’s game and the Nash bargaining point. 3 NEGOTIATION PROTOCOL The Rubinstein game implicitly assumes that the result of a failure to cooperate is known to all players: nobody gets any pie. The multi-player version of the game assumes in addition that giving one player a share of the pie doesn’t force us to give a share to any other player. Neither of these properties holds for general stochastic games. They are, however, easy to check, and often hold or can be made to hold for planning domains of interest. So, we will assume that the players have agreed beforehand on a subgame-perfect equilibrium π dis , called the disagreement policy, that they will follow in the event of a negotiation failure. In addition, for games with three or more players, we will assume that each player can unilaterally reduce her own utility by any desired amount without affecting other players’ utilities. Given these assumptions, our protocol proceeds in two phases (pseudocode is given in the technical report [13]. In the first phase agents compute subgame-perfect equilibria and take turns revealing them. On an agent’s turn she either reveals an equilibrium or passes; if all agents pass consecutively, the protocol proceeds to the second phase. When an agent states a policy π, the other agents verify that π is a subgame-perfect equilibrium and calculate its payoff vector Vπ (sstart ); players who state non-equilibrium policies miss their turn. At the end of the first phase, suppose the players have revealed a set Π of policies. Define Xp (π) = Vpπ (sstart ) − Vpdis (sstart ) U = convhull {X(π) | π ∈ Π} U = {u ≥ 0 | (∃v ∈ U | u ≤ v)} where Vdis is the value function of π dis , Xp (π) is the excess of policy π for player p, and U is the set of feasible excess vectors. In the second phase, players take turns proposing points u ∈ U along with policies or mixtures of policies in Π that achieve them. After each proposal, all agents except the pro- poser decide whether to accept or reject. If everyone accepts, the proposal is implemented: everyone starts executing the agreed equilibrium. Otherwise, the players who accepted are removed from future negotiation and have their utilities fixed at the proposed levels. Fixing player p’s utility at up means that all future proposals must give p exactly up . Invalid proposals cause the proposer to lose her turn. To achieve this, the proposal may require p to voluntarily lower her own utility; this requirement is enforced by the threat that all players will revert to π dis if p fails to act as required. If at some point we hit the chance of having the current round of communication end, all remaining players are assigned their disagreement values. The players execute the last proposed policy π (or π dis if there has been no valid proposal), and any player p for whom Vpπ (sstart ) is greater than her assigned utility up voluntarily lowers her utility to the correct level. (Again, failure to do so results in all players reverting to π dis .) Under the above protocol, player’s preferences are the same as in a Rubinstein game with utility set U: because we have assumed that negotiation ends with probability after each message, agreeing on u after t additional steps is exactly as good as agreeing on u(1− )t now. So with sufficiently small, the Rubinstein or Krishna-Serrano results show that rational players will agree on a vector u ∈ U which is close to the Nash point argmaxu∈U Πp up . 4 COMPUTING EQUILIBRIA In order to use the protocol of Sec. 3 for bargaining in a stochastic game, the players must be able to compute some subgame-perfect equilibria. Computing equilibria is a hard problem, so we cannot expect real agents to find the entire set of equilibria. Fortunately, each player will want to find the equilibria which are most advantageous to herself to influence the negotiation process in her favor. But equilibria which offer other players reasonably high reward have a higher chance of being accepted in negotiation. So, self interest will naturally distribute the computational burden among all the players. In this section we describe an efficient dynamic-programming algorithm for computing equilibria. The algorithm takes some low-payoff equilibria as input and (usually) outputs higherpayoff equilibria. It is based on the intuition that we can use low-payoff equilibria as enforcement tools: by threatening to switch to an equilibrium that has low value to player p, we can deter p from deviating from a cooperative policy. pun pun In more detail, we will assume that we are given P different equilibria π1 , . . . , πP ; we pun pun dis will use πp to punish player p if she deviates. We can set πp = π for all p if π dis is the only equilibrium we know; or, we can use any other equilibrium policies that we happen pun to have discovered. The algorithm will be most effective when the value of πp to player p is as low as possible in all states. pun We will then search for cooperative policies that we can enforce with the given threats πp . We will first present an algorithm which pretends that we can efficiently take direct sums and convex hulls of arbitrary sets. This algorithm is impractical, but finds all enforceable value vectors. We will then turn it into an approximate algorithm which uses finite data structures to represent the set-valued variables. As we allow more and more storage for each set, the approximate algorithm will approach the exact one; and in any case the result will be a set of equilibria which the agents can execute. 4.1 THE EXACT ALGORITHM Our algorithm maintains a set of value vectors V(s) for each state s. It initializes V(s) to a set which we know contains the value vectors for all equilibrium policies. It then refines V by dynamic programming: it repeatedly attempts to improve the set of values at each state by backing up all of the joint actions, excluding joint actions from which some agent has an incentive to deviate. In more detail, we will compute Vpdis (s) ≡ Vpπdis (s) for all s and p and use the vector Vdis (s) in our initialization. (Recall that we have defined Vpπ (s) for a nonstationary policy π as the value of π if s were the start state.) We also need the values of the punishment policies for Initialization for s ∈ S V(s) ← {V | Vpdis (s) ≤ Vp ≤ Rmax /(1 − γ)} end Repeat until converged for iteration ← 1, 2, . . . for s ∈ S Compute value vector set for each joint action, then throw away unenforceable vectors for a ∈ A Q(s, a) ← {R(s, a)} + γ s ∈S T (s, a)(s )V(s ) Q(s, a) ← {Q ∈ Q(s, a) | Q ≥ Vdev (s, a)} end We can now randomize among joint actions V(s) ← convhull a Q(s, a) end end Figure 2: Dynamic programming using exact operations on sets of value vectors π pun their corresponding players, Vppun (s) ≡ Vp p (s) for all p and s. Given these values, define Qdev (s, a) = Rp (s, a) + γ p T (s, a)(s )Vppun (s ) (1) s ∈S pun to be the value to player p of playing joint action a from state s and then following πp forever after. From the above Qdev values we can compute player p’s value for deviating from an equilibp rium which recommends action a in state s: it is Qdev (s, a ) for the best possible deviation p a , since p will get the one-step payoff for a but be punished by the rest of the players starting on the following time step. That is, Vpdev (s, a) = max Qdev (s, a1 × . . . × ap × . . . × aP ) p ap ∈Ap (2) Vpdev (s, a) is the value we must achieve for player p in state s if we are planning to recommend pun action a and punish deviations with πp : if we do not achieve this value, player p would rather deviate and be punished. Our algorithm is shown in Fig. 2. After k iterations, each vector in V(s) corresponds to a k-step policy in which no agent ever has an incentive to deviate. In the k + 1st iteration, the first assignment to Q(s, a) computes the value of performing action a followed by any k-step policy. The second assignment throws out the pairs (a, π) for which some agent would want to deviate from a given that the agents plan to follow π in the future. And the convex hull accounts for the fact that, on reaching state s, we can select an action a and future policy π at random from the feasible pairs.2 Proofs of convergence and correctness of the exact algorithm are in the technical report [13]. Of course, we cannot actually implement the algorithm of Fig. 2, since it requires variables whose values are convex sets of vectors. But, we can approximate V(s) by choosing a finite set of witness vectors W ⊂ RP and storing V(s, w) = arg maxv∈V(s) (v·w) for each w ∈ W. V(s) is then approximated by the convex hull of {V(s, w) | w ∈ W}. If W samples the P dimensional unit hypersphere densely enough, the maximum possible approximation error will be small. (In practice, each agent will probably want to pick W differently, to focus her computation on policies in the portion of the Pareto frontier where her own utility is relatively high.) As |W| increases, the error introduced at each step will go to zero. The approximate algorithm is given in more detail in the technical report [13]. 2 It is important for this randomization to occur after reaching state s to avoid introducing incentives to deviate, and it is also important for the randomization to be public. P1 1 P2 1 P1 1 P2 1 P1 1 P2 1 P2 2 P1 2 P2 2 P1 2 P2 2 P1 2 Figure 3: Execution traces for our motion planning example. Left and Center: with 2 witness vectors , the agents randomize between two selfish paths. Right: with 4–32 witnesses, the agents find a cooperative path. Steps where either player gets a goal are marked with ×. 90 shop C D E D Value to Player 2 85 A 80 75 70 ABC 65 E D A B 60 40 50 60 Value to Player 1 70 80 Figure 4: Supply chain management problem. In the left figure, Player 1 is about to deliver part D to the shop, while player 2 is at the warehouse which sells B. The right figure shows the tradeoff between accuracy and computation time. The solid curve is the Pareto frontier for sstart , as computed using 8 witnesses per state. The dashed and dotted lines were computed using 2 and 4 witnesses, respectively. Dots indicate computed value vectors; × marks indicate the Nash points. 5 EXPERIMENTS We tested our value iteration algorithm and negotiation procedure on two robotic planning domains: a joint motion planning problem and a supply-chain management problem. In our motion planning problem (Fig. 3), two players together control a two-wheeled robot, with each player picking the rotational velocity for one wheel. Each player has a list of goal landmarks which she wants to cycle through, but the two players can have different lists of goals. We discretized states based on X, Y, θ and the current goals, and discretized actions into stop, slow (0.45 m ), and fast (0.9 m ), for 9 joint actions and about 25,000 states. s s We discretized time at ∆t = 1s, and set γ = 0.99. For both the disagreement policy and all punishment policies, we used “always stop,” since by keeping her wheel stopped either player can prevent the robot from moving. Planning took a few hours of wall clock time on a desktop workstation for 32 witnesses per state. Based on the planner’s output, we ran our negotiation protocol to select an equilibrium. Fig. 3 shows the results: with limited computation the players pick two selfish paths and randomize equally between them, while with more computation they find the cooperative path. Our experiments also showed that limiting the computation available to one player allows the unrestricted player to reveal only some of the equilibria she knows about, tilting the outcome of the negotiation in her favor (see the technical report [13] for details). For our second experiment we examined a more realistic supply-chain problem. Here each player is a parts supplier competing for the business of an engine manufacturer. The manufacturer doesn’t store items and will only pay for parts which can be used immediately. Each player controls a truck which moves parts from warehouses to the assembly shop; she pays for parts when she picks them up, and receives payment on delivery. Each player gets parts from different locations at different prices and no one player can provide all of the parts the manufacturer needs. Each player’s truck can be at six locations along a line: four warehouse locations (each of which provides a different type of part), one empty location, and the assembly shop. Building an engine requires five parts, delivered in the order A, {B, C}, D, E (parts B and C can arrive in either order). After E, the manufacturer needs A again. Players can move left or right along the line at a small cost, or wait for free. They can also buy parts at a warehouse (dropping any previous cargo), or sell their cargo if they are at the shop and the manufacturer wants it. Each player can only carry one part at a time and only one player can make a delivery at a time. Finally, any player can retire and sell her truck; in this case the game ends and all players get the value of their truck plus any cargo. The disagreement policy is for all players to retire at all states. Fig. 4 shows the computed sets V(sstart ) for various numbers of witnesses. The more witnesses we use, the more accurately we represent the frontier, and the closer our final policy is to the true Nash point. All of the policies computed are “intelligent” and “cooperative”: a human observer would not see obvious ways to improve them, and in fact would say that they look similar despite their differing payoffs. Players coordinate their motions, so that one player will drive out to buy part E while the other delivers part D. They sit idle only in order to delay the purchase of a part which would otherwise be delivered too soon. 6 CONCLUSION Real-world planning problems involve negotiation among multiple agents with varying goals. To take all agents incentives into account, the agents should find and agree on Paretodominant subgame-perfect Nash equilibria. For this purpose, we presented efficient planning and negotiation algorithms for general-sum stochastic games, and tested them on two robotic planning problems. References [1] V. Conitzer and T. Sandholm. Complexity results about Nash equilibria. Technical Report CMU-CS-02-135, School of Computer Science, Carnegie-Mellon University, 2002. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Massachusetts, 1995. [3] Prajit K. Dutta. A folk theorem for stochastic games. Journal of Economic Theory, 66:1–32, 1995. [4] Yevgeniy Dodis, Shai Halevi, and Tal Rabin. A cryptographic solution to a game theoretic problem. In Lecture Notes in Computer Science, volume 1880, page 112. Springer, Berlin, 2000. [5] Michael L. Littman and Peter Stone. A polynomial-time Nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, pages 48–54. ACM, 2003. [6] E. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, pages 709–715, 2004. [7] Ulrich Doraszelski and Kenneth L. Judd. Avoiding the curse of dimensionality in dynamic stochastic games. NBER Technical Working Paper No. 304, January 2005. [8] R. Brafman and M. Tennenholtz. Efficient learning equilibrium. Artificial Intelligence, 2004. [9] D Fudenberg and E. Maskin. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 1986. [10] David Levine. The castle on the hill. Review of Economic Dynamics, 3(2):330–337, 2000. [11] Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica, 50(1):97–109, 1982. [12] V. Krishna and R. Serrano. Multilateral bargaining. Review of Economic Studies, 1996. [13] Chris Murray and Geoffrey J. Gordon. Multi-robot negotiation: approximating the set of subgame perfect equilibria in general-sum stochastic games. Technical Report CMU-ML-06114, Carnegie Mellon University, 2006.

4 0.59791583 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

Author: Neil D. Lawrence, Guido Sanguinetti, Magnus Rattray

Abstract: Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.

5 0.55836672 13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

6 0.47588801 14 nips-2006-A Small World Threshold for Economic Network Formation

7 0.45461962 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

8 0.38051271 155 nips-2006-Optimal Single-Class Classification Strategies

9 0.29233277 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

10 0.28127992 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

11 0.24726073 124 nips-2006-Linearly-solvable Markov decision problems

12 0.24451032 166 nips-2006-Recursive Attribute Factoring

13 0.24311616 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

14 0.24048333 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

15 0.23898254 98 nips-2006-Inferring Network Structure from Co-Occurrences

16 0.22983637 192 nips-2006-Theory and Dynamics of Perceptual Bistability

17 0.22732152 196 nips-2006-TrueSkill™: A Bayesian Skill Rating System

18 0.22642741 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

19 0.22090662 5 nips-2006-A Kernel Method for the Two-Sample-Problem

20 0.20062515 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.09), (3, 0.019), (7, 0.054), (9, 0.032), (11, 0.393), (20, 0.02), (22, 0.058), (44, 0.043), (57, 0.042), (64, 0.012), (65, 0.031), (69, 0.016), (71, 0.065), (90, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82752997 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

Author: Luis Pérez-breva, Luis E. Ortiz, Chen-hsiang Yeang, Tommi S. Jaakkola

Abstract: We develop and analyze game-theoretic algorithms for predicting coordinate binding of multiple DNA binding regulators. The allocation of proteins to local neighborhoods and to sites is carried out with resource constraints while explicating competing and coordinate binding relations among proteins with affinity to the site or region. The focus of this paper is on mathematical foundations of the approach. We also briefly demonstrate the approach in the context of the λ-phage switch. 1

2 0.66988772 181 nips-2006-Stability of $K$-Means Clustering

Author: Alexander Rakhlin, Andrea Caponnetto

Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of HK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2 ) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in HK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K. 1

3 0.35976404 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

Author: Huan Xu, Shie Mannor

Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1

4 0.35342106 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

Author: Thomas Voegtlin

Abstract: In biological neurons, the timing of a spike depends on the timing of synaptic currents, in a way that is classically described by the Phase Response Curve. This has implications for temporal coding: an action potential that arrives on a synapse has an implicit meaning, that depends on the position of the postsynaptic neuron on the firing cycle. Here we show that this implicit code can be used to perform computations. Using theta neurons, we derive a spike-timing dependent learning rule from an error criterion. We demonstrate how to train an auto-encoder neural network using this rule. 1

5 0.35194358 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

Author: Neil D. Lawrence, Guido Sanguinetti, Magnus Rattray

Abstract: Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.

6 0.35040629 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

7 0.35014847 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

8 0.34638575 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

9 0.34398124 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron

10 0.33764872 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

11 0.33747041 65 nips-2006-Denoising and Dimension Reduction in Feature Space

12 0.33600974 154 nips-2006-Optimal Change-Detection and Spiking Neurons

13 0.3345167 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

14 0.33254233 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

15 0.33243886 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

16 0.33157364 20 nips-2006-Active learning for misspecified generalized linear models

17 0.33062348 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

18 0.33047405 175 nips-2006-Simplifying Mixture Models through Function Approximation

19 0.3298583 138 nips-2006-Multi-Task Feature Learning

20 0.32905328 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming