nips nips2007 nips2007-52 knowledge-graph by maker-knowledge-mining

52 nips-2007-Competition Adds Complexity


Source: pdf

Author: Judy Goldsmith, Martin Mundhenk

Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. [sent-5, score-0.412]

2 1 Partially observable stochastic games A partially observable stochastic game (POSG) describes multi-player stochastic game with imperfect information by its states and the consequences of the players actions on the system. [sent-20, score-0.349]

3 , k} of agents (or players), S is a finite set of states, with distinguished initial state s0 ∈ S, A is a finite set of actions, and O is a finite set of observations • t : S × Ak × S → [0, 1] is the transition probability function, where t(s, a1 , . [sent-24, score-0.902]

4 , ak , s′ ) is the probability that state s′ is reached from state s when each agent i chooses action ai • o : S × I → O is the observation function , where o(s, i) is the observation made in state s by agent i, and 1 • r : S × Ak × I → Z is the reward function, where r(s, a1 , . [sent-27, score-0.837]

5 , ak , i) is the reward gained by agent i in state s, when the agents take actions a1 , . [sent-30, score-1.15]

6 ) A POSG where all agents have the same reward function is called a decentralized partiallyobservable Markov decision process (see [1]). [sent-35, score-0.844]

7 A (history-dependent) policy π chooses an action dependent on all observations made by the agent during the run of the process. [sent-41, score-0.268]

8 i=1 We use Tl (s) to denote all length l trajectories which start in the initial state s0 and end in state s. [sent-68, score-0.244]

9 The k k expected reward Ri (s, l, π1 ) obtained by agent i in state s after exactly l steps under policies π1 is k the reward obtained in s by the actions according to π1 weighted by the probability that s is reached after l steps, k k l k Ri (s, l, π1 ) = r(s, π1 (θ1 ), i) · prob(θ , π1 ) . [sent-69, score-0.687]

10 1 The short-term performance for policies π1 for agent i with POSG M is the expected sum of rewards received by agent i during the next |M | steps by following the policy k π1 , i. [sent-78, score-0.571]

11 The cooperative agents problem for k agents: instance: query: a POSG M for k agents are there policies π1 , . [sent-85, score-1.71]

12 ) i=1 The competing agents problem for 2k agents: instance: query: a POSG M for 2k agents are there policies π1 , . [sent-95, score-1.671]

13 , k have positive performance independent of which policies agents k + 1, k + 2, . [sent-101, score-0.898]

14 [1] that the cooperative agents problem for two or more agents is complete for NEXP. [sent-115, score-1.557]

15 3 Domino tilings Domino tiling problems are useful for reductions between different kinds of computations. [sent-125, score-0.235]

16 Every pair of two neighboured tiles in the same row is in H. [sent-135, score-0.271]

17 Every pair of two neighboured tiles in the same column is in V . [sent-140, score-0.241]

18 The exponential square tiling problem is the set of all pairs (T, 1k ), where T is a tile type and 1k is a string consisting of k 1s (k ∈ N), such that there exists a T -tiling of the 2k -square. [sent-144, score-0.408]

19 It was shown by Savelsbergh and van Emde Boas [4] that the exponential square tiling problem is complete for NEXP. [sent-145, score-0.335]

20 We will consider the following variant, which we call the exponential Σ2 square tiling problem: given a pair (T, 1k ), does there exist a row w of tiles and a T -tiling of the 2k -square with final row w, such that there exists no T -tiling of the 2k -square with initial row w? [sent-146, score-0.618]

21 29 in [4], which translates Turing machine computations into tilings, is very robust in the sense that simple variants of the square tiling problem can analogously be shown to be complete for different complexity classes. [sent-148, score-0.321]

22 2 The exponential Σ2 square tiling problem is complete for NEXPNP . [sent-151, score-0.335]

23 Papadimitriou and Tsitsiklis [5] proved that it is PSPACE-complete to decide the cooperative agents problem for POMDPs. [sent-153, score-0.812]

24 1 For any k ≥ 2, the cooperative agents problem for k agents for stationary policies is NP-complete. [sent-161, score-1.769]

25 Hence, the cooperative agents problem for stationary policies is NP-hard. [sent-165, score-1.042]

26 , πk be a sequence of stationary policies for the k agents. [sent-172, score-0.253]

27 Under a fixed sequence of policies, the performance of the POSG for all of the agents can be calculated in polynomial time. [sent-174, score-0.75]

28 Using a guess and check approach (guess the stationary policies and evaluate the POSG), this shows that the cooperative agents problem for stationary policies is in NP. [sent-175, score-1.323]

29 instance: query: a POSG M for k agents do all agents under every stationary policy have positive performance? [sent-180, score-1.583]

30 ) i=1 The cooperative agents problem was shown to be NEXP-complete by Bernstein et al. [sent-187, score-0.812]

31 Not surprisingly, if the agents compete, the problem becomes harder. [sent-189, score-0.727]

32 3 For every k ≥ 1, the competing agents problem for 2k agents is in NEXPNP . [sent-191, score-1.517]

33 , k, and construct a POSG that “implements” these policies and leaves open the actions chosen by agents k + 1, . [sent-199, score-0.96]

34 In M ′ , we have as agents 0 those of M , whose policies are not fixed, i. [sent-212, score-0.898]

35 The meaning of state (s, u) ∈ S′ is, that state s can be reached on a trajectory u (that ends with s) through M with the fixed policies. [sent-221, score-0.304]

36 Essentially, we are interested in the rewards obtained by the agents 1, 2, . [sent-245, score-0.776]

37 The rewards obtained by the other agents have no impact on this, only the actions the other agents choose. [sent-249, score-1.565]

38 Therefore, agent i obtains the rewards in M ′ that are obtained by agent i − k in M . [sent-250, score-0.333]

39 , 2k obtain in M ′ the same rewards that are obtained by agents 1, 2, . [sent-254, score-0.776]

40 For w ∈ S≤|M | we use o(w, i) to describe the sequence of observations made by agent i on trajectory w. [sent-270, score-0.253]

41 The sink state in M ′ is the only state that lies on a loop. [sent-274, score-0.292]

42 This means, that on all trajectories through M ′ , the sink state is the only state that may appear more than once. [sent-275, score-0.321]

43 Therefore, there is a one-to-one correspondence between history-dependent policies for M and stationary policies for M ′ (with regard to horizon |M |). [sent-277, score-0.401]

44 Thus, this yields an NEXPNP algorithm to decide the competitive agents problem. [sent-290, score-0.727]

45 In the first step, the policies for the agents 1, 2, . [sent-292, score-0.898]

46 Finally, the oracle is queried whether M ′ has positive performance for all agents under all stationary policies. [sent-299, score-0.846]

47 Henceforth, the algorithm shows the competing agents problem to be in NEXPNP . [sent-302, score-0.773]

48 4 For every k ≥ 2, the competing agents problem for 2k agents is hard for NEXPNP . [sent-304, score-1.517]

49 Proof We give a reduction from the exponential Σ2 square tiling problem to the competing agents problem. [sent-305, score-1.09]

50 Let T = (T, 1k ) be an instance of the exponential Σ2 square tiling problem, where T = (V, H) is a tile type. [sent-306, score-0.393]

51 The basic idea for checking of tilings with POSGs for two agents stems from Bernstein et al. [sent-308, score-0.765]

52 [1], but we give a slight simplification of their proof technique, and in fact have to extend it for four agents later on. [sent-309, score-0.727]

53 The POSG is constructed so that on every trajectory each agent sees a position in the square. [sent-310, score-0.245]

54 The only action of the agent that has impact on the process is putting a tile on the given position. [sent-312, score-0.279]

55 In fact, the same position is observed by the agents in different states of the POSG. [sent-313, score-0.782]

56 The first part checks whether both agents know the same tiling, without checking that it is a correct tiling. [sent-315, score-0.867]

57 In the state where the agents are asked to put their tiles on the given position, a high negative reward is obtained if the agents put different tiles on that position. [sent-316, score-2.007]

58 The second part checks whether the tiling is correct. [sent-318, score-0.314]

59 The idea is to give both the agents neighboured positions in the square and to ask each which tile she puts on that position. [sent-319, score-1.017]

60 Notice that the agents do not know in which part of the process they are. [sent-320, score-0.783]

61 A high negative reward will be obtained if the agents’ tiles do not fit together. [sent-323, score-0.236]

62 For the first part, we need to construct is a POSG Pk for two agents, that allows both agents to make the same sequence of observations consisting of 2k bits. [sent-324, score-0.792]

63 At this state, it will be checked whether both agents put the same tile at this position (see later on). [sent-327, score-0.918]

64 The task of Pk is to provide both agents with the same position. [sent-328, score-0.727]

65 The observation of agent 1 is written on the left hand side of the states, and the observations of agent 2 at the right hand side. [sent-332, score-0.326]

66 In Pk both agents always make the same observations. [sent-334, score-0.727]

67 The goal is to provide both agents with neighboured positions in the square. [sent-336, score-0.837]

68 Eventually, it is checked whether the tiles they put on the neighboured positions are according to the tile type T . [sent-337, score-0.433]

69 The POSG Cl,k is intended to provide the agents with two neighboured positions in the same row, where the index of the leftmost bit of the column encoding where both positions distinguish is l. [sent-353, score-0.863]

70 The “final state” of Cl,k is the state hori, from which it is checked whether the agents put horizontically fitting tiles together. [sent-356, score-1.086]

71 In the same way, a POSG Rl,k can be constructed (R stands for row), whose task is, to check whether two tiles in neighboured rows correspond to a correct tiling. [sent-357, score-0.3]

72 This POSG has the final state vert, from which on it is checked whether two tiles fit vertically. [sent-358, score-0.308]

73 From state same the state good is reached, if both agents take the same action – otherwise bad is reached. [sent-364, score-1.0]

74 From state hori the state good is reached, if action a1 by agent 1 and a2 by agent 2 make a pair (a1 , a2 ) in H, i. [sent-365, score-0.569]

75 Similarly, from state vert the state good is reached, if action a1 by agent 1 and a2 by agent 2 make a pair (a1 , a2 ) in V . [sent-368, score-0.554]

76 From state good the state sink is reached on every action with reward 1, and from state bad the state sink is reached on every action with reward −(22k+2 ). [sent-370, score-1.053]

77 When the state sink is reached, the process stays there on any action, and all agents obtain reward 0. [sent-371, score-1.01]

78 From these POSGs we construct a POSG T2,k that checks whether two agents know the same correct tiling for a 2k × 2k square, as described above. [sent-374, score-1.046]

79 The initial state of each part can be reached with one step from the initial state s0 of T2,k . [sent-376, score-0.354]

80 • P2k with initial state s (checks whether two agents have the same tiling) • For each l = 1, 2, . [sent-378, score-0.885]

81 6 s0 sk c1 ck r1 rk Pk C1,k Ck,k R1,k Rk,k hori vert same good bad sink Figure 4: T2,k • For each l = 1, 2, . [sent-383, score-0.278]

82 From the initial state s0 of T2,k , each of the initial states of the parts is reachable independent on the action chosen by the agents. [sent-390, score-0.267]

83 When a state same, hori, vert is reached, each agent has made 2k + 3 observations, where the first and last are ε and the remaining 2k are each in {0, 1}. [sent-398, score-0.278]

84 Such a state is the only one where the actions of the agents have impact on the process. [sent-399, score-0.877]

85 The agents can win, if they both know the same correct tiling and interpret the sequence of observations as the position in the grid they are asked to put a tile on. [sent-401, score-1.143]

86 On the other hand, if both agents know different tilings or the tiling they share is not correct, then at least one trajectory will end in a bad state and has reward −(22k+2 ). [sent-402, score-1.264]

87 Claim 2 Let (T, 1k ) be an instance of the exponential square tiling problem. [sent-404, score-0.317]

88 (2) There exists a T -tiling of the 2k square if and only if there exist policies for the agents under which T2,k has performance > 0. [sent-406, score-0.994]

89 If there exists a T -tiling of the 2k square, both agents use the same policy according to this tiling. [sent-409, score-0.795]

90 For the other direction: if there exist policies for the agents under which T2,k has performance > 0, then state bad is not reached. [sent-412, score-1.037]

91 7 The POSG for the competing agents problem with 4 agents consists of three parts. [sent-415, score-1.5]

92 It is used to check whether the first square can be tiled correctly (by agents 1 and 2). [sent-417, score-0.888]

93 In this part, the negative rewards are increased in a way that guarantees the performance of the POSG to be negative whenever agents 1 and 2 do not correctly tile their square. [sent-418, score-0.87]

94 It is used to check whether the second square can be tiled correctly (by agents 3 and 4). [sent-420, score-0.888]

95 Whenever state bad is left in this copy, reward 0 is obtained, and whenever state good is left, reward −1 is obtained. [sent-421, score-0.403]

96 The third part checks whether agent 1 puts the same tiles into the last row of its square as agent 3 puts into the first row of its square. [sent-422, score-0.745]

97 If agents 1 and 2 have a tiling for the first square, the performance of the first part equals 1. [sent-426, score-0.98]

98 • If agents 3 and 4 are able to continue this tiling through their square, the performance of the second part equals −1 and the performance of the third part equals 0. [sent-427, score-1.036]

99 • If agents 3 and 4 are not able to continue this tiling through their square, then the performance of part 2 and part 3 is strictly greater −1. [sent-429, score-0.99]

100 5 For every k ≥ 2, the competing agents problem for 2k agents is complete for NEXPNP . [sent-435, score-1.535]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agents', 0.727), ('posg', 0.398), ('tiling', 0.197), ('policies', 0.171), ('nexpnp', 0.169), ('tiles', 0.157), ('agent', 0.142), ('sink', 0.116), ('posgs', 0.097), ('state', 0.088), ('cooperative', 0.085), ('neighboured', 0.084), ('square', 0.081), ('reward', 0.079), ('tile', 0.076), ('perf', 0.072), ('ak', 0.067), ('reached', 0.067), ('hori', 0.063), ('trajectory', 0.061), ('stationary', 0.059), ('players', 0.055), ('policy', 0.053), ('checks', 0.053), ('bad', 0.051), ('bernstein', 0.051), ('rewards', 0.049), ('vert', 0.048), ('actions', 0.047), ('competing', 0.046), ('action', 0.046), ('prob', 0.042), ('turing', 0.042), ('exponential', 0.039), ('initial', 0.039), ('tilings', 0.038), ('pk', 0.037), ('mundhenk', 0.036), ('nondeterministic', 0.036), ('part', 0.033), ('checked', 0.032), ('whether', 0.031), ('states', 0.03), ('row', 0.03), ('game', 0.029), ('trajectories', 0.029), ('oracle', 0.029), ('observable', 0.029), ('check', 0.028), ('put', 0.027), ('observations', 0.027), ('positions', 0.026), ('utilities', 0.025), ('parts', 0.025), ('position', 0.025), ('complexity', 0.025), ('domino', 0.024), ('emde', 0.024), ('horizontically', 0.024), ('jena', 0.024), ('judy', 0.024), ('nexp', 0.024), ('savelsbergh', 0.024), ('guess', 0.023), ('puts', 0.023), ('equals', 0.023), ('sequence', 0.023), ('know', 0.023), ('pomdps', 0.022), ('query', 0.021), ('cooperate', 0.021), ('decentralized', 0.021), ('tiled', 0.021), ('stochastic', 0.021), ('transition', 0.021), ('player', 0.02), ('goldsmith', 0.019), ('partially', 0.019), ('games', 0.019), ('asked', 0.018), ('copy', 0.018), ('complete', 0.018), ('whenever', 0.018), ('papadimitriou', 0.018), ('every', 0.017), ('tr', 0.017), ('decision', 0.017), ('tl', 0.016), ('ri', 0.016), ('strategy', 0.016), ('winning', 0.015), ('pomdp', 0.015), ('exists', 0.015), ('impact', 0.015), ('transitions', 0.015), ('compete', 0.015), ('observation', 0.015), ('construct', 0.015), ('steps', 0.014), ('decided', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 52 nips-2007-Competition Adds Complexity

Author: Judy Goldsmith, Martin Mundhenk

Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .

2 0.13341561 55 nips-2007-Computing Robust Counter-Strategies

Author: Michael Johanson, Martin Zinkevich, Michael Bowling

Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1

3 0.09947028 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

4 0.098471224 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

Author: Elad Hazan, Satyen Kale

Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1

5 0.09191817 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

6 0.09103255 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

7 0.084077768 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

8 0.068807647 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

9 0.066650175 215 nips-2007-What makes some POMDP problems easy to approximate?

10 0.063987076 30 nips-2007-Bayes-Adaptive POMDPs

11 0.061948221 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

12 0.059253298 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

13 0.054250546 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

14 0.053296413 102 nips-2007-Incremental Natural Actor-Critic Algorithms

15 0.052447837 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

16 0.049473092 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

17 0.04519115 86 nips-2007-Exponential Family Predictive Representations of State

18 0.040590908 165 nips-2007-Regret Minimization in Games with Incomplete Information

19 0.040514641 163 nips-2007-Receding Horizon Differential Dynamic Programming

20 0.038832314 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.097), (1, -0.181), (2, 0.047), (3, -0.037), (4, 0.008), (5, 0.037), (6, -0.017), (7, 0.018), (8, 0.035), (9, 0.077), (10, -0.068), (11, -0.067), (12, 0.061), (13, -0.058), (14, 0.023), (15, 0.008), (16, 0.005), (17, 0.022), (18, 0.031), (19, -0.03), (20, 0.062), (21, 0.012), (22, -0.035), (23, 0.052), (24, -0.058), (25, 0.024), (26, 0.12), (27, 0.062), (28, -0.009), (29, -0.081), (30, -0.053), (31, 0.031), (32, 0.043), (33, 0.052), (34, -0.039), (35, -0.012), (36, 0.005), (37, -0.089), (38, -0.004), (39, -0.067), (40, -0.034), (41, -0.027), (42, -0.062), (43, 0.034), (44, 0.048), (45, 0.034), (46, -0.068), (47, -0.038), (48, -0.077), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94578052 52 nips-2007-Competition Adds Complexity

Author: Judy Goldsmith, Martin Mundhenk

Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .

2 0.70213568 55 nips-2007-Computing Robust Counter-Strategies

Author: Michael Johanson, Martin Zinkevich, Michael Bowling

Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1

3 0.5652768 165 nips-2007-Regret Minimization in Games with Incomplete Information

Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione

Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1

4 0.5600481 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

5 0.52701223 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy

Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1

6 0.50257951 162 nips-2007-Random Sampling of States in Dynamic Programming

7 0.50073111 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.4719049 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

9 0.46777421 171 nips-2007-Scan Strategies for Meteorological Radars

10 0.45067498 215 nips-2007-What makes some POMDP problems easy to approximate?

11 0.43895787 30 nips-2007-Bayes-Adaptive POMDPs

12 0.41578886 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

13 0.39923951 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

14 0.39886007 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events

15 0.36308986 163 nips-2007-Receding Horizon Differential Dynamic Programming

16 0.34804174 185 nips-2007-Stable Dual Dynamic Programming

17 0.32464847 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

18 0.30856076 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

19 0.29512489 191 nips-2007-Temporal Difference Updating without a Learning Rate

20 0.29145092 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.363), (5, 0.016), (13, 0.054), (16, 0.013), (18, 0.015), (21, 0.034), (29, 0.013), (31, 0.046), (32, 0.013), (34, 0.016), (35, 0.025), (42, 0.017), (47, 0.094), (49, 0.016), (83, 0.089), (85, 0.018), (87, 0.011), (90, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74557388 52 nips-2007-Competition Adds Complexity

Author: Judy Goldsmith, Martin Mundhenk

Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .

2 0.55306119 134 nips-2007-Multi-Task Learning via Conic Programming

Author: Tsuyoshi Kato, Hisashi Kashima, Masashi Sugiyama, Kiyoshi Asai

Abstract: When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.

3 0.38890857 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

4 0.37427205 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

5 0.37222269 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

6 0.37086925 86 nips-2007-Exponential Family Predictive Representations of State

7 0.36937651 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

8 0.36734292 84 nips-2007-Expectation Maximization and Posterior Constraints

9 0.36633766 100 nips-2007-Hippocampal Contributions to Control: The Third Way

10 0.36563301 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

11 0.36433676 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

12 0.36405471 115 nips-2007-Learning the 2-D Topology of Images

13 0.36404803 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

14 0.36275652 102 nips-2007-Incremental Natural Actor-Critic Algorithms

15 0.36261904 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

16 0.36237568 24 nips-2007-An Analysis of Inference with the Universum

17 0.36221099 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

18 0.36218658 43 nips-2007-Catching Change-points with Lasso

19 0.36133161 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

20 0.36049393 69 nips-2007-Discriminative Batch Mode Active Learning