nips nips2009 nips2009-221 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
Reference: text
sentIndex sentText sentNum sentScore
1 1 Introduction In reinforcement learning, Bellman’s dynamic programming equation is typically viewed as a method for determining the value function — the maximum achievable utility at each state. [sent-9, score-0.294]
2 In this paper we seek to find the set of all achievable joint utilities (a vector of utilities, one for each player). [sent-12, score-0.229]
3 The first problem is caused by the intolerance of an equilibrium to error, and the second results from a potential need for an unbounded number of points to define the closed convex hull that is each states feasible-set. [sent-18, score-0.535]
4 We solve the first problem by targeting -equilibria instead of exact equilibria, and we solve the second by approximating the hull with a bounded number of points. [sent-19, score-0.278]
5 2 Agenda We model the world as a fully-observable n-player stochastic game with cheap talk (communication between agents that does not affect rewards). [sent-22, score-0.426]
6 Stochastic games (also called Markov games) are the natural multi-agent extension of Markov decision processes with actions being joint actions and rewards being a vector of rewards, one to each player. [sent-23, score-0.364]
7 We assume an implicit inclusion of past joint 1 actions as part of state (we actually only rely on log2 n + 1 bits of history containing if and who has defected). [sent-24, score-0.191]
8 We also assume that each player is rational in the game-theoretic sense. [sent-25, score-0.475]
9 Our goal is to produce a joint policy that is Pareto-optimal (no other viable joint policy gives a player more utility without lowering another player’s utility), fair (players agree on the joint policy), and in equilibrium (no player can gain by deviating from the joint policy). [sent-26, score-1.705]
10 We factor out the various game theoretic elements of the problem by taking in three functions which compute in turn: the equilibrium Feq (such as correlated equilibrium), the threat Fth (such as grim trigger), and the bargaining solution Fbs (such as Nash bargaining solution). [sent-29, score-0.917]
11 Consider a simple game (Figure 1-A): Reward:{1,-2} (1, -0. [sent-37, score-0.225]
12 B) The final feasible-set for player 1’s state (γ = 0. [sent-43, score-0.491]
13 Player 2 Utility This game has four states with two terminal states. [sent-45, score-0.303]
14 In the two middle states play alternates between the two players until one of the players decides to exit the game. [sent-46, score-0.37]
15 In this game the only equilibria 1 are stochastic (E. [sent-47, score-0.501]
16 the randomized policy of each player passing and exiting with probability 2 ). [sent-49, score-0.55]
17 In each state only one of the agents takes an(0, 0) so an algorithm that depends only on a value action, (0, 0) Player 1’s will myopically choose to deterministically take the best (0, 0) and never converge to function action, the stochastic equilibrium. [sent-50, score-0.222]
18 This result exposed the inadequacy of value functions to capture cyclic Choice (1, -2) equilibrium (where the equilibrium policy may revisit a state). [sent-51, score-0.751]
19 One 1 Utility only stage-game equilibria and not full-game equilibria potentially ignoring much better solutions (Shoham & Grenager, 2006). [sent-53, score-0.462]
20 Our approach solves this problem and allows a full-game equilibrium to be reached. [sent-54, score-0.301]
21 Another complaint goes even further, challenging the desire to even target equilibria (0, 0) (0, solutions are correct when (Shoham shown Player 2’s et al. [sent-55, score-0.267]
22 Game theorists have (0, 0) us that equilibrium0) agents are rational (infinitely intelligent), so the argument against targeting equilibria boils down Choice assuming other agents are not infinitely intelligent (which is reasonable) or that finding to either (2, -1) 1 (1. [sent-57, score-0.61]
23 33) The precise meaning of fair, and the type of equilibrium is intentionally left unspecified for generality. [sent-61, score-0.301]
24 Player 1 Utility Iteration: Initialization 2 1 2 Equilibrium Contraction equilibria is not computationally tractable (which we tackle here). [sent-62, score-0.274]
25 We believe that although MAL is primarily concerned with the case when agents are not fully rational, first assuming agents are rational and subsequently relaxing this assumption will prove to be an effective approach. [sent-63, score-0.286]
26 In their later technical report (Murray & Gordon, June 2007) they provided an exact solution equivalent to our solution targeting subgame perfect correlated equilibrium with credible threats, while using the Nash bargaining solution for equilibrium selection. [sent-65, score-0.969]
27 4 Exact feasible-set solution They key idea needed to extend reinforcement learning into multi-agent domains is to replace the value-function, V (s), in Bellman’s dynamic program with a feasible-set function – a mapping from state to feasible-set. [sent-68, score-0.208]
28 As a group of n agents follow a joint-policy, each player i receives rewards. [sent-69, score-0.564]
29 As this set contains all possible joint-utilities, it will contain the optimal joint-utility for any definition of optimal (the bargaining solution Fbs will select the utility vector it deems optimal). [sent-75, score-0.217]
30 Recall that agents care only about the utility they achieve and not the specific policy used. [sent-77, score-0.348]
31 The set is a closed convex hull with extreme points (1, −0. [sent-80, score-0.194]
32 This feasible-set depicts the fact that when starting in player 1’s state any full game equilibria will result in a joint-utility that is some weighted average of these three points. [sent-84, score-0.947]
33 5) by having player 1 always pass and player 2 exit with probability 0. [sent-86, score-0.964]
34 If player 2 tries to cheat by passing when they are supposed to exit, player 1 will immediate exit in retaliation (recall that history is implicitly included in state). [sent-88, score-0.964]
35 An exact dynamic programing solution falls out naturally after replacing the value-function in Bellman’s dynamic program with a feasible-set function, however the changes in variable dimension complicate the backup. [sent-89, score-0.215]
36 An illustration of the modified backup is shown in Figure 2, where steps A-C solve for the action-feasible-set (Q(s, a)), and steps D-E solve for V (s) given Q(s, a). [sent-90, score-0.187]
37 We assume an equilibrium filter function Feq is provided to the algorithm, which is applied to eliminate non-equilibrium policies. [sent-92, score-0.301]
38 Each iteration of the backup then contracts the feasible-sets, eliminating unachievable utility-vectors. [sent-96, score-0.294]
39 A more detailed examination of the exact algorithm including a formal treatment of the backup, various game theoretic issues, and convergence proofs are given in Murray and Gordon’s technical report (June 2007). [sent-99, score-0.289]
40 The first problem is that the size of the game itself is exponential in the number of agents because joint actions are 3 (. [sent-102, score-0.489]
41 45) Player 1 Action ½ Player 2 Action R P S (0, 0) (0, 0) (0, 0) Feasible sets of all joint actions E) Feasible set of initial state Figure 2: An example of the backup step (one iteration of our modified Bellman equation). [sent-108, score-0.432]
42 The state shown being calculated is an initial rock-paper-scissors game played to decide who goes first in the breakup game from Figure 1. [sent-109, score-0.591]
43 The backup shown depicts the 2nd iteration of the dynamic program when feasible-sets are initialized to (0,0) and binding contracts are allowed (Feq = set union). [sent-111, score-0.369]
44 For each combination of points from each successor state the expected value is found (in this case 1/2 of the bottom and 1/2 of the top). [sent-113, score-0.166]
45 Finally, in step E, the feasible outcomes of all joint actions are fed into Feq to yield the updated feasibility set of our state. [sent-119, score-0.214]
46 This problem is unavoidable unless we approximate the game which is outside the scope of this paper. [sent-121, score-0.225]
47 The second problem is that although the exact algorithm always converges, it is not guaranteed to converge in finite time (during the equilibrium backup, an arbitrarily small update can lead to a drastically large change in the resulting contracted set). [sent-122, score-0.365]
48 The approximation scheme yields a solution that is an 1 /(1−γ)-equilibrium of the full game while guaranteeing there exists no exact equilibrium that Pareto-dominates the solution’s utility. [sent-127, score-0.701]
49 This means that despite not being able to calculate the true utilities at each stage game, if other players did know the true utilities they would gain no more than 1 /(1 − γ) by defecting. [sent-128, score-0.298]
50 By targeting an 1 /(1 − γ)-equilibrium we do not mean that the backup’s equilibrium filter function Feq is an -equilibrium (it could be, although making it such would do nothing to alleviate the convergence problem). [sent-130, score-0.394]
51 if agents followed a prescribed policy they would find their actual rewards to be no less than 1 /(1 − γ) promised). [sent-137, score-0.307]
52 In other words, after a backup each feasible-set is in equilibrium (according to the filter function) with respect to the previous iteration’s estimation. [sent-139, score-0.488]
53 If that previous estimation is off by at most 1 /(1 − γ) than the most any one player could gain by deviating is 1 /(1 − γ). [sent-140, score-0.439]
54 Because we are only checking for a stopping condition, and not explicitly targeting the 1 /(1 − γ)-equilibrium in the backup we can’t guarantee that the algorithm will terminate with the best 1 /(1 − γ)-equilibrium. [sent-141, score-0.346]
55 Instead we can guarantee that when we do terminate we know that our feasible-sets contain all equilibrium satisfying our original equilibrium filter and no equilibrium with incentive greater than an 1 /(1 − γ) to deviate. [sent-142, score-0.934]
56 However, we want to insure that our feasible estimation is always an over estimation and not an under estimation, otherwise the equilibrium contraction step may erroneously eliminate valid policies. [sent-147, score-0.467]
57 On the other hand by targeting 1 -equilibrium we can terminate if the backups fail to make 1 progress. [sent-153, score-0.202]
58 B) The hull from A-I is approximated using halfspaces from a given regular approximation of a Euclidean ball. [sent-162, score-0.272]
59 We take a fixed set of hyperplanes which form a regular approximation of a Euclidean ball such that the hyperplane’s normals form an angle of at most θ with their neighbors (E. [sent-164, score-0.173]
60 We then project these halfspaces onto the polytope we wish to approximate (i. [sent-167, score-0.18]
61 After removing redundant hyperplanes the resulting polytope is returned as the approximation (Figure 3B). [sent-170, score-0.189]
62 3 Computing expected feasible-sets Another difficulty occurs during the backup of Q(s, a). [sent-178, score-0.187]
63 An optimized version of the algorithm described in this paper would only need the frontier, not the full set as calculating the frontier depends only on the frontier (unless the threat function needs the entire set). [sent-184, score-0.307]
64 Like our modified view of the Bellman equation as trying to find the entire set of achievable policy payoffs so too can we view linear programming as trying to find the entire set of achievable values of the objective function. [sent-186, score-0.344]
65 This problem is known as multiobjective linear programming and has been previously studied by a small community of operation researchers under the umbrella subject of multiobjective optimization (Branke et al. [sent-189, score-0.178]
66 4 Computing correlated equilibria of sets Our generalized algorithm requires an equilibrium-filter function Feq . [sent-195, score-0.268]
67 Note than requiring Feq to return a closed convex set disqualifies Nash equilibria and its refinements. [sent-203, score-0.304]
68 Due to the availability of cheap talk, reasonable choices for Feq include correlated equilibria (CE), -CE, or a coalition resistant variant of CE. [sent-204, score-0.268]
69 Constructing Feq is more complicated than computing the equilibria for a stage game so we describe below how to target CE. [sent-206, score-0.456]
70 For a normal-form game the set of correlated equilibria can be determined by taking the intersection of a set of halfspaces (linear inequality constraints) (Greenwald & Hall, 2003). [sent-207, score-0.6]
71 Each variable of these halfspaces represents the probability that a particular joint action is chosen (via a shared random variable) and each halfspace represents a rationality constraint that a player being told to take n one action would not want to switch to another action. [sent-208, score-0.752]
72 There are 1 |Ai |(|Ai | − 1) such rationality constraints (where |Ai | is the number of actions player i can take). [sent-209, score-0.53]
73 Instead we have a feasible-set of possible outcomes for each joint action Q(s, a) and a threat function Fth . [sent-211, score-0.234]
74 Recall that when following a policy to achieve a desired payoff, not only must a joint action be given, but also subsequent payoffs for each successor state. [sent-212, score-0.399]
75 Thus the halfspace variables must not only specify probabilities over joint actions but also the subsequent payoffs (a probability distribution over the extreme points of each successor feasible-set). [sent-213, score-0.3]
76 Luckily, a mixture of probability distributions is still a probability distribution so our final halfspaces now have a |Q(s, a)| variables (we still have the same number of halfspaces with the same meaning as before). [sent-214, score-0.214]
77 At the end of the day we do not want feasible probabilities over successor states, we want the utility-vectors afforded by them. [sent-215, score-0.189]
78 1) Our feasible-set approximation scheme was carefully constructed so that it would not permit backtracking, maintaining monotonicity (all other steps of the backup are exact). [sent-220, score-0.264]
79 After all feasible-sets shrink by less than 1 we could modify the game by giving a bonus reward less than 1 to each player in each state (equal to that state’s shrinkage). [sent-222, score-0.765]
80 This modified game would then have converged exactly (and thus would have a perfectly achievable feasible-set as proved by Murray and Gordon). [sent-223, score-0.318]
81 Any joint-policy of the modified game will yield at most 1 /(1 − γ) more than the same joint-policy of our original game thus all utilities of our original game are off by at most 1 /(1 − γ). [sent-224, score-0.763]
82 2) and our equilibrium filter function Feq is required to be monotonic and thus will never underestimate if operating on overestimates (this is why we require monotonicity of Feq ). [sent-227, score-0.334]
83 Thus our algorithm maintains the four crucial properties and terminates with all exact equilibria (as per conservative backups) while containing no equilibrium with error greater than 1 /(1 − γ). [sent-230, score-0.641]
84 6 Empirical results We implemented a version of our algorithm targeting exact correlated equilibrium using grim trigger threats (defection is punished to the maximum degree possible by all other players, even at one’s own expense). [sent-231, score-0.708]
85 The grim trigger threat reduces to a 2 person zero sum game where the defector receives their normal reward and all other players receive the opposite reward. [sent-232, score-0.627]
86 Because the other players receive the same reward in this game they can be viewed as a single entity. [sent-233, score-0.396]
87 Note that grim trigger threats can be computed separately before the main algorithm is run. [sent-235, score-0.213]
88 , 1995) to compute the convex hull of our feasible-sets and to determine the normals of the set’s facets. [sent-239, score-0.216]
89 In other words, when computing the Pareto frontier during the backup the algorithm relies on no points except those of the previous step’s Pareto frontier. [sent-243, score-0.287]
90 Thus computing only the Pareto frontier at each iteration is not an approximation, but an exact simplification. [sent-244, score-0.218]
91 We tested our algorithm on a number of problems with known closed form solutions, including the breakup game (Figure 4). [sent-245, score-0.349]
92 We also tested the algorithm on a suite of random games varying across the number of states, number of players, number of actions, number of successor states (stochasticity of 7 the game), coarseness of approximation, and density of rewards. [sent-246, score-0.217]
93 Terminal State Player 1 Equilibrium Iteration: 0 10 20 30 40 50 Figure 4: A visualization of feasible-sets for the terminal state and player 1’s state of the breakup game at various iterations of the dynamic program. [sent-249, score-0.937]
94 2 0 1 10 19 28 Iterations 37 46 Figure 5: Statistics from a random game (100 states, 2 players, 2 actions each, with 1 = 0. [sent-260, score-0.316]
95 C) Better approximations converge more each iteration (the coarser approximations have a longer tail), however due to the additional computational costs the 12 hyperplane approximation converged quickest (in total wall time). [sent-265, score-0.176]
96 1 Limitations Our approach is overkill when the feasible-sets are one dimensional (line segments) (as when the game is zero-sum, or agents share a reward function), because CE-Q learning will converge to the correct solution without additional overhead. [sent-271, score-0.433]
97 However despite scaling linearly with the number of states, the multiobjective linear program for computing the equilibrium hull scales very poorly. [sent-274, score-0.544]
98 The MOLP remains tractable only up to about 15 joint actions (which results in a few hundred variables and a few dozen constraints, depending on feasible-set size). [sent-275, score-0.182]
99 Finding correlated equilibria in general sum stochastic games (Technical Report). [sent-355, score-0.376]
100 Multi-robot negotiation: Approximating the set of subgame perfect equilibria in general-sum stochastic games. [sent-361, score-0.276]
wordName wordTfidf (topN-words)
[('player', 0.439), ('equilibrium', 0.301), ('feq', 0.267), ('equilibria', 0.231), ('game', 0.225), ('backup', 0.187), ('bellman', 0.145), ('murray', 0.126), ('agents', 0.125), ('players', 0.122), ('hull', 0.121), ('successor', 0.114), ('utility', 0.112), ('policy', 0.111), ('halfspaces', 0.107), ('threat', 0.107), ('xau', 0.107), ('pareto', 0.1), ('frontier', 0.1), ('gordon', 0.098), ('achievable', 0.093), ('targeting', 0.093), ('actions', 0.091), ('breakup', 0.089), ('molp', 0.089), ('multiobjective', 0.089), ('threats', 0.089), ('utilities', 0.088), ('exit', 0.086), ('action', 0.079), ('backups', 0.078), ('feasible', 0.075), ('polytope', 0.073), ('littman', 0.073), ('hyperplanes', 0.072), ('rewards', 0.071), ('bargaining', 0.071), ('grim', 0.071), ('exact', 0.064), ('games', 0.063), ('normals', 0.057), ('shoham', 0.057), ('iteration', 0.054), ('fth', 0.054), ('trigger', 0.053), ('branke', 0.053), ('contracts', 0.053), ('dagstuhl', 0.053), ('fbs', 0.053), ('greenwald', 0.053), ('grenager', 0.053), ('insure', 0.053), ('steuer', 0.053), ('state', 0.052), ('reward', 0.049), ('joint', 0.048), ('reinforcement', 0.047), ('payoffs', 0.047), ('rmax', 0.047), ('stochastic', 0.045), ('conservative', 0.045), ('approximation', 0.044), ('tractable', 0.043), ('dynamic', 0.042), ('lter', 0.041), ('hyperplane', 0.04), ('foreach', 0.04), ('au', 0.04), ('states', 0.04), ('nash', 0.038), ('contraction', 0.038), ('wall', 0.038), ('cyclic', 0.038), ('terminal', 0.038), ('convex', 0.038), ('correctness', 0.037), ('correlated', 0.037), ('rational', 0.036), ('targeted', 0.036), ('adbase', 0.036), ('complaint', 0.036), ('ibfi', 0.036), ('isbell', 0.036), ('reisner', 0.036), ('schloss', 0.036), ('closed', 0.035), ('stopping', 0.035), ('june', 0.035), ('solution', 0.034), ('guaranteeing', 0.033), ('monotonicity', 0.033), ('program', 0.033), ('ai', 0.033), ('nal', 0.032), ('progress', 0.032), ('terminate', 0.031), ('lopez', 0.031), ('liam', 0.031), ('achievability', 0.031), ('talk', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
2 0.34254947 232 nips-2009-Strategy Grafting in Extensive Games
Author: Kevin Waugh, Nolan Bard, Michael Bowling
Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1
3 0.27309218 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
4 0.27262634 161 nips-2009-Nash Equilibria of Static Prediction Games
Author: Michael Brückner, Tobias Scheffer
Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1
5 0.14536609 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering
Author: Samuel R. Bulò, Marcello Pelillo
Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Specifically, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player “clustering game”, whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.
6 0.1185194 53 nips-2009-Complexity of Decentralized Control: Special Cases
7 0.10777552 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
8 0.106911 113 nips-2009-Improving Existing Fault Recovery Policies
9 0.10026295 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
10 0.091430642 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
11 0.088075429 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
12 0.086444736 48 nips-2009-Bootstrapping from Game Tree Search
13 0.086432591 242 nips-2009-The Infinite Partially Observable Markov Decision Process
14 0.080415703 14 nips-2009-A Parameter-free Hedging Algorithm
15 0.075098813 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
16 0.06513118 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
17 0.057424493 16 nips-2009-A Smoothed Approximate Linear Program
18 0.055849712 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
19 0.048703048 54 nips-2009-Compositionality of optimal control laws
20 0.04857048 134 nips-2009-Learning to Explore and Exploit in POMDPs
topicId topicWeight
[(0, -0.144), (1, 0.086), (2, 0.185), (3, -0.217), (4, -0.279), (5, 0.059), (6, -0.016), (7, -0.03), (8, 0.077), (9, -0.004), (10, -0.245), (11, -0.339), (12, -0.173), (13, -0.07), (14, -0.043), (15, -0.246), (16, -0.073), (17, 0.027), (18, 0.022), (19, -0.04), (20, 0.049), (21, 0.006), (22, 0.038), (23, 0.025), (24, 0.013), (25, 0.034), (26, -0.012), (27, 0.028), (28, -0.025), (29, 0.013), (30, 0.027), (31, 0.07), (32, -0.022), (33, -0.027), (34, -0.01), (35, -0.023), (36, 0.018), (37, -0.002), (38, 0.022), (39, -0.003), (40, -0.055), (41, -0.069), (42, -0.03), (43, 0.023), (44, 0.023), (45, 0.002), (46, -0.017), (47, 0.056), (48, -0.042), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.94934636 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
2 0.9390803 232 nips-2009-Strategy Grafting in Extensive Games
Author: Kevin Waugh, Nolan Bard, Michael Bowling
Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1
3 0.84849578 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
4 0.8011058 161 nips-2009-Nash Equilibria of Static Prediction Games
Author: Michael Brückner, Tobias Scheffer
Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1
5 0.54230613 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering
Author: Samuel R. Bulò, Marcello Pelillo
Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Specifically, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player “clustering game”, whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.
6 0.47941598 48 nips-2009-Bootstrapping from Game Tree Search
7 0.36284414 113 nips-2009-Improving Existing Fault Recovery Policies
8 0.34257585 16 nips-2009-A Smoothed Approximate Linear Program
9 0.34148863 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
10 0.3320753 14 nips-2009-A Parameter-free Hedging Algorithm
11 0.3224718 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
12 0.31205955 53 nips-2009-Complexity of Decentralized Control: Special Cases
13 0.30288896 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
14 0.27880141 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
15 0.25500083 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
16 0.25258586 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
17 0.25105095 242 nips-2009-The Infinite Partially Observable Markov Decision Process
18 0.24902153 134 nips-2009-Learning to Explore and Exploit in POMDPs
19 0.24701022 54 nips-2009-Compositionality of optimal control laws
20 0.23063612 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
topicId topicWeight
[(21, 0.01), (24, 0.511), (25, 0.045), (35, 0.053), (36, 0.072), (39, 0.031), (58, 0.042), (61, 0.031), (71, 0.058), (81, 0.013), (86, 0.058), (91, 0.011)]
simIndex simValue paperId paperTitle
1 0.96330416 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
2 0.95510846 181 nips-2009-Online Learning of Assignments
Author: Matthew Streeter, Daniel Golovin, Andreas Krause
Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1
3 0.93461955 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.
same-paper 4 0.92776299 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
5 0.89529288 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
6 0.77351177 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
7 0.75097054 45 nips-2009-Beyond Convexity: Online Submodular Minimization
8 0.70380563 161 nips-2009-Nash Equilibria of Static Prediction Games
9 0.70163363 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
10 0.67210799 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
11 0.66128647 239 nips-2009-Submodularity Cuts and Applications
12 0.6341334 14 nips-2009-A Parameter-free Hedging Algorithm
13 0.63250661 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
14 0.6268509 178 nips-2009-On Stochastic and Worst-case Models for Investing
15 0.62629879 122 nips-2009-Label Selection on Graphs
16 0.61688757 94 nips-2009-Fast Learning from Non-i.i.d. Observations
17 0.6160109 55 nips-2009-Compressed Least-Squares Regression
18 0.60868204 232 nips-2009-Strategy Grafting in Extensive Games
19 0.59975547 24 nips-2009-Adapting to the Shifting Intent of Search Queries
20 0.59086746 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank