nips nips2009 nips2009-156 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
Reference: text
sentIndex sentText sentNum sentScore
1 One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). [sent-9, score-0.743]
2 In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. [sent-11, score-0.568]
3 We start by showing that MCCFR performs the same regret updates as CFR on expectation. [sent-12, score-0.285]
4 Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. [sent-13, score-0.433]
5 Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. [sent-15, score-0.335]
6 Multiple techniques [1, 2] now exist for solving games with up to 1012 game states, which is about four orders of magnitude larger than the previous state-of-the-art of using sequence-form linear programs [3]. [sent-21, score-0.269]
7 Counterfactual regret minimization (CFR) [1] is one such recent technique that exploits the fact that the time-averaged strategy profile of regret minimizing algorithms converges to a Nash equilibrium. [sent-22, score-0.679]
8 The key insight is the fact that minimizing per-information set counterfactual regret results in minimizing overall regret. [sent-23, score-0.577]
9 However, the vanilla form presented by Zinkevich and colleagues requires the entire game tree to be traversed on each iteration. [sent-24, score-0.497]
10 An additional disadvantage of CFR is that it requires the opponent’s policy to be known, which makes it unsuitable for online regret minimization in an extensive game. [sent-30, score-0.44]
11 Online regret minimization in extensive games is possible using online convex programming techniques, such as Lagrangian Hedging [5], but these techniques can require costly optimization routines at every time step. [sent-31, score-0.517]
12 In this paper, we present a general framework for sampling in counterfactual regret minimization. [sent-32, score-0.547]
13 We then introduce two additional members of this family: outcome-sampling, where only a single playing of the game is sampled on each iteration; and external-sampling, which samples chance nodes and the opponent’s actions. [sent-35, score-0.363]
14 Furthermore, since outcome-sampling does not need knowledge of the opponent’s strategy beyond samples of play from the strategy, we describe how it can be used for online regret minimization. [sent-38, score-0.365]
15 2 Background An extensive game is a general model of sequential decision-making with imperfect information. [sent-40, score-0.295]
16 As with perfect information games (such as Chess or Checkers), extensive games consist primarily of a game tree: each non-terminal node has an associated player (possibly chance) that makes the decision at that node, and each terminal node has associated utilities for the players. [sent-41, score-0.962]
17 Additionally, game states are partitioned into information sets where a player cannot distinguish between two states in the same information set. [sent-42, score-0.491]
18 200] a finite extensive game with imperfect information has the following components: • A finite set N of players. [sent-46, score-0.295]
19 P (h) is the player who takes an action after the history h. [sent-53, score-0.456]
20 • For each player i ∈ N ∪ {c} a partition Ii of {h ∈ H : P (h) = i} with the property that A(h) = A(h ) whenever h and h are in the same member of the partition. [sent-55, score-0.35]
21 For Ii ∈ Ii we denote by A(Ii ) the set A(h) and by P (Ii ) the player P (h) for any h ∈ Ii . [sent-56, score-0.325]
22 Ii is the information partition of player i; a set Ii ∈ Ii is an information set of player i. [sent-57, score-0.65]
23 2 • For each player i ∈ N a utility function ui from the terminal states Z to the reals R. [sent-62, score-0.546]
24 Define ∆u,i = maxz ui (z) − minz ui (z) to be the range of utilities to player i. [sent-64, score-0.512]
25 Furthermore, we will assume perfect recall, a restriction on the information partitions such that a player can always distinguish between game states where they previously took a different action or were previously in a different information set. [sent-66, score-0.548]
26 1 Strategies and Equilibria A strategy of player i, σi , in an extensive game is a function that assigns a distribution over A(Ii ) to each Ii ∈ Ii . [sent-68, score-0.624]
27 We denote Σi as the set of all strategies for player i. [sent-69, score-0.353]
28 Let π σ (h) be the probability of history h occurring if all players choose actions according to σ. [sent-75, score-0.264]
29 Here, σ σ πi (h) is the contribution to this probability from player i when playing according to σ. [sent-77, score-0.367]
30 Let π−i (h) be the product of all players’ contribution (including chance) except that of player i. [sent-78, score-0.325]
31 Using this notation, we can define the expected payoff for player i as ui (σ) = h∈Z ui (h)π σ (h). [sent-82, score-0.509]
32 The best-response value for player i is the value of that strategy, bi (σ−i ) = maxσi ∈Σi ui (σi , σ−i ). [sent-84, score-0.406]
33 An -Nash equilibrium is an approximation of a Nash equilibrium; it is a strategy profile σ that satisfies ∀i ∈ N ui (σ) + ≥ max ui (σi , σ−i ) σi ∈Σi (1) If = 0 then σ is a Nash Equilibrium: no player has any incentive to deviate as they are all playing best responses. [sent-85, score-0.66]
34 Let σi be the strategy used by player i on round t. [sent-90, score-0.383]
35 The average overall regret of player i at time T is: T T Ri = 1 ∗ t max ui (σi , σ−i ) − ui (σ t ) ∗ T σi ∈Σi t=1 (2) Moreover, define σi to be the average strategy for player i from time 1 to T . [sent-91, score-1.185]
36 ¯ t An algorithm for selecting σi for player i is regret minimizing if player i’s average overall regret t (regardless of the sequence σ−i ) goes to zero as t goes to infinity. [sent-95, score-1.269]
37 Moreover, an algorithm’s bounds on the average overall regret bounds the convergence rate of the approximation. [sent-97, score-0.315]
38 Zinkevich and colleagues [1] used the above approach in their counterfactual regret algorithm (CFR). [sent-98, score-0.556]
39 The basic idea of CFR is that overall regret can be bounded by the sum of positive per-informationset immediate counterfactual regret. [sent-99, score-0.562]
40 Define σ(I→a) to be 3 a strategy profile identical to σ except that player i always chooses action a from information set I. [sent-101, score-0.415]
41 Let ZI be the subset of all terminal histories where a prefix of the history is in the set I; for z ∈ ZI let z[I] be that prefix. [sent-102, score-0.341]
42 Define counterfactual value vi (σ, I) as, σ π−i (z[I])π σ (z[I], z)ui (z). [sent-104, score-0.261]
43 vi (σ, I) = (4) z∈ZI T T The immediate counterfactual regret is then Ri,imm (I) = maxa∈A(I) Ri,imm (I, a), where T Ri,imm (I, a) = 1 T T t vi (σ(I→a) , I) − vi (σ t , I) (5) t=1 Let x+ = max(x, 0). [sent-105, score-0.643]
44 Theorem 2 [1, Theorem 3] T Ri ≤ I∈Ii T,+ Ri,imm (I) Using regret-matching2 the positive per-information set immediate counterfactual regrets can be driven to zero, thus driving average overall regret to zero. [sent-107, score-0.654]
45 This results in an average overall regret √ T bound [1, Theorem 4]: Ri ≤ ∆u,i |Ii | |Ai |/ T , where |Ai | = maxh:P (h)=i |A(h)|. [sent-108, score-0.34]
46 This result suggests an algorithm for computing equilibria via self-play, which we will refer to as vanilla CFR. [sent-110, score-0.279]
47 The idea is to traverse the game tree computing counterfactual values using Equation 4. [sent-111, score-0.442]
48 Given a strategy, these values define regret terms for each player for each of their information sets using Equation 5. [sent-112, score-0.61]
49 These regret values accumulate and determine the strategies at the next iteration using the regret-matching formula. [sent-113, score-0.37]
50 Since both players are regret minimizing, Theorem 1 applies and so computing the strategy profile σ t gives us an approximate Nash Equilibrium. [sent-114, score-0.43]
51 However, as previously mentioned vanilla CFR requires a complete traversal of the game tree on each iteration, which prohibits its use in many large games. [sent-116, score-0.468]
52 3 Monte Carlo CFR The key to our approach is to avoid traversing the entire game tree on each iteration while still having the immediate counterfactual regrets be unchanged in expectation. [sent-118, score-0.578]
53 On each iteration we will sample one of these blocks and only consider the terminal histories in that block. [sent-125, score-0.282]
54 Let qj > 0 r be the probability of considering block Qj for the current iteration (where j=1 qj = 1). [sent-126, score-0.317]
55 The sampled counterfactual value when updating block j is: vi (σ, I|j) = ˜ z∈Qj ∩ZI 1 σ ui (z)π−i (z[I])π σ (z[I], z) q(z) (6) Selecting a set Q along with the sampling probabilities defines a complete sample-based CFR algorithm. [sent-130, score-0.467]
56 Rather than doing full game tree traversals the algorithm samples one of these blocks, and then examines only the terminal histories in that block. [sent-131, score-0.464]
57 , one block containing all terminal histories and q1 = 1. [sent-134, score-0.287]
58 In this case, sampled counterfactual value is equal to counterfactual value, and we have vanilla CFR. [sent-135, score-0.706]
59 Suppose instead we choose each block to include all terminal histories with the same sequence of chance outcomes (where the probability of a chance outcome is independent of players’ actions as 2 t Regret-matching selects actions with probability proportional to their positive regret, i. [sent-136, score-0.643]
60 Hence qj is the product of the probabilities in the sampled sequence of chance outcomes (which cancels with these same probabilities in the definition of counterfactual value) and we have Zinkevich and colleagues’ chance-sampled CFR. [sent-141, score-0.517]
61 Sampled counterfactual value was designed to match counterfactual value on expectation. [sent-142, score-0.448]
62 We show this here, and then use this fact to prove a probabilistic bound on the algorithm’s average overall regret in the next section. [sent-143, score-0.34]
63 We sample a block and for each information set that contains a prefix of a terminal history in the block we compute the sampled immediate counterfactual regrets of each action, r(I, a) = vi (σ(I→a) , I) − vi (σ t , I). [sent-147, score-0.766]
64 On each iteration we sample one terminal history and only update each information set along that history. [sent-154, score-0.279]
65 The sampling probabilities, qj must specify a distribution over terminal histories. [sent-155, score-0.294]
66 The single history is then traversed forward (to compute each player’s probability of playing to reach each prefix of the history, σ πi (h)) and backward (to compute each player’s probability of playing the remaining actions of the σ history, πi (h, z)). [sent-160, score-0.278]
67 During the backward traversal, the sampled counterfactual regrets at each visited information set are computed (and added to the total regret). [sent-161, score-0.34]
68 Therefore, we can use outcomesampling MCCFR for online regret minimization. [sent-164, score-0.307]
69 By balancing the regret caused by exploration with the regret caused by a small δ (see Section 4 for how MCCFR’s bound depends upon δ), we can bound the average overall regret as long as the number of playings T is known in advance. [sent-166, score-0.935]
70 This effectively mimics the approach taking by Exp3 for regret minimization in normalform games [9]. [sent-167, score-0.42]
71 The block Qτ then contains all terminal histories z consistent with τ , that is if ha is a prefix of z with h ∈ I for some I ∈ I−i then τ (I) = a. [sent-176, score-0.287]
72 The algorithm iterates over i ∈ N and for each doing a post-order depth-first traversal of the game tree, sampling actions at each history h where P (h) = i (storing these choices so the same actions are sampled at all h in the same information set). [sent-179, score-0.518]
73 For each such visited information set the sampled counterfactual regrets are computed (and added to the total regrets). [sent-181, score-0.34]
74 σ ui (z)πi (z[I]a, z) r(I, a) = (1 − σ(a|I)) ˜ (11) z∈Q∩ZI Note that the summation can be easily computed during the traversal by always maintaining a weighted sum of the utilities of all terminal histories rooted at the current history. [sent-182, score-0.383]
75 4 Theoretical Analysis We now present regret bounds for members of the MCCFR family, starting with an improved bound for vanilla CFR that depends more explicitly on the exact structure of the extensive game. [sent-183, score-0.639]
76 Let ai be a subsequence of a history such that it contains only player i’s actions in that history, and let Ai be the set of all such player i action subsequences. [sent-184, score-0.898]
77 Let Ii (ai ) be the set of all information sets where player i’s action sequence up to that information set is ai . [sent-185, score-0.396]
78 Define the M -value for player i of the game to be Mi = ai ∈Ai |Ii (a)|. [sent-186, score-0.53]
79 We can strengthen vanilla CFR’s regret bound using this constant, which also appears in the bounds for the MCCFR variants. [sent-188, score-0.544]
80 T Theorem 3 When using vanilla CFR for player i, Ri ≤ ∆u,i Mi √ |Ai |/ T . [sent-189, score-0.559]
81 We now turn our attention to the MCCFR family of algorithms, for which we can provide probabilistic regret bounds. [sent-190, score-0.312]
82 Theorem 4 For any p ∈ (0, 1], when using external-sampling MCCFR, with probability at least √ √ 2 T 1 − p, average overall regret is bounded by, Ri ≤ 1 + √p ∆u,i Mi |Ai |/ T . [sent-192, score-0.315]
83 tvc , tos , and tes are the average wall-clock time per iteration4 for vanilla CFR, outcome-sampling MCCFR, and external-sampling MCCFR. [sent-203, score-0.303]
84 Goofspiel [11] is a bidding card game where players have a hand of cards numbered 1 to N , and take turns secretly bidding on the top point-valued card in a point card stack using cards in their hands. [sent-205, score-0.401]
85 Our version is less informational: players only find out the result of each bid and not which cards were used to bid, and the player with the highest total points wins. [sent-206, score-0.459]
86 1] is a pursuit-evasion game on a graph, neither player ever knowing the location of the other. [sent-211, score-0.491]
87 While all of these games have imperfect information and roughly of similar size, they are a diverse set of games, varying both in the degree (the ratio of the number of information sets to the number of histories) and nature (whether due to chance or opponent actions) of imperfect information. [sent-215, score-0.363]
88 We used outcome-sampling MCCFR, external-sampling MCCFR, and vanilla CFR to compute an approximate equilibrium in each of the four games. [sent-217, score-0.307]
89 6 worked well across all games; this is interesting because the regret bound suggests δ should be as large as possible. [sent-221, score-0.31]
90 With vanilla CFR we used to an implementational trick called pruning to dramatically reduce the work done per iteration. [sent-223, score-0.286]
91 When updating one player’s regrets, if the other player has no probability of reaching the current history, the entire subtree at that history can be pruned for the current iteration, with no effect on the resulting computation. [sent-224, score-0.424]
92 We also used vanilla CFR without pruning to see the effects of pruning in our domains. [sent-225, score-0.338]
93 Figure 1 shows the results of all four algorithms on all four domains, plotting approximation quality as a function of the number of nodes of the game tree the algorithm touched while computing. [sent-226, score-0.302]
94 Furthermore, for any fixed game (and degree of confidence that the bound holds), the algorithms’ average overall regret √ is falling at the same rate, O(1/ T ), meaning that only their short-term rather than asymptotic performance will differ. [sent-229, score-0.506]
95 Note that pruning is key to vanilla CFR being at all practical in these games. [sent-235, score-0.286]
96 With outcome-sampling performing worse than vanilla CFR in LTTT, this raises the question of what specific game properties might favor one algorithm over another and whether it might be possible to incorporate additional game specific constants into the bounds. [sent-269, score-0.566]
97 We also introduced two sampling schemes: outcome-sampling, which samples only a single history for each iteration, and externalsampling, which samples a deterministic strategy for the opponent and chance. [sent-271, score-0.269]
98 In addition to presenting a tighter bound for vanilla CFR, we presented regret bounds for both sampling variants, which showed that external sampling with high probability gives an asymptotic computational time improvement over vanilla CFR. [sent-272, score-0.871]
99 Second, using outcome-sampled MCCFR as a general online regret minimizing technique in extensive games (when the opponents’ strategy is not known or controlled) appears promising. [sent-277, score-0.562]
100 Monte carlo sampling for regret minimization in extensive games. [sent-321, score-0.449]
wordName wordTfidf (topN-words)
[('cfr', 0.593), ('player', 0.325), ('mccfr', 0.314), ('regret', 0.285), ('vanilla', 0.234), ('counterfactual', 0.224), ('game', 0.166), ('terminal', 0.14), ('qj', 0.116), ('games', 0.103), ('histories', 0.102), ('history', 0.099), ('zinkevich', 0.096), ('regrets', 0.092), ('players', 0.087), ('ui', 0.081), ('chance', 0.078), ('actions', 0.078), ('extensive', 0.075), ('nash', 0.074), ('opponent', 0.074), ('equilibrium', 0.073), ('touched', 0.07), ('strategy', 0.058), ('imperfect', 0.054), ('pruning', 0.052), ('poker', 0.051), ('pre', 0.05), ('colleagues', 0.047), ('goofspiel', 0.046), ('ii', 0.046), ('equilibria', 0.045), ('block', 0.045), ('playing', 0.042), ('alberta', 0.042), ('pro', 0.041), ('iteration', 0.04), ('fc', 0.039), ('ai', 0.039), ('sampling', 0.038), ('vi', 0.037), ('traversal', 0.035), ('bowling', 0.035), ('lanctot', 0.035), ('lttt', 0.035), ('nodes', 0.033), ('tree', 0.033), ('le', 0.032), ('minimization', 0.032), ('action', 0.032), ('zi', 0.031), ('waugh', 0.031), ('overall', 0.03), ('ri', 0.028), ('strategies', 0.028), ('martin', 0.027), ('family', 0.027), ('policy', 0.026), ('member', 0.025), ('mi', 0.025), ('outcome', 0.025), ('perfect', 0.025), ('utilities', 0.025), ('bound', 0.025), ('sampled', 0.024), ('cards', 0.024), ('immediate', 0.023), ('bid', 0.023), ('monster', 0.023), ('outweighs', 0.023), ('tes', 0.023), ('tos', 0.023), ('traversals', 0.023), ('tvc', 0.023), ('payoff', 0.022), ('online', 0.022), ('supplemental', 0.021), ('michael', 0.021), ('card', 0.02), ('hedging', 0.02), ('bidding', 0.02), ('subsuming', 0.02), ('cancels', 0.02), ('princess', 0.02), ('carmelo', 0.02), ('exploitability', 0.02), ('johanson', 0.02), ('members', 0.02), ('minimizing', 0.019), ('outcomes', 0.019), ('traverse', 0.019), ('maxh', 0.019), ('carlo', 0.019), ('monte', 0.019), ('probabilities', 0.018), ('accumulate', 0.017), ('external', 0.017), ('traversed', 0.017), ('edmonton', 0.016), ('ic', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
2 0.32193363 232 nips-2009-Strategy Grafting in Extensive Games
Author: Kevin Waugh, Nolan Bard, Michael Bowling
Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1
3 0.27309218 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
4 0.18920895 14 nips-2009-A Parameter-free Hedging Algorithm
Author: Kamalika Chaudhuri, Yoav Freund, Daniel J. Hsu
Abstract: We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret. 1
5 0.17299151 161 nips-2009-Nash Equilibria of Static Prediction Games
Author: Michael Brückner, Tobias Scheffer
Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1
6 0.11671203 178 nips-2009-On Stochastic and Worst-case Models for Investing
7 0.1150302 45 nips-2009-Beyond Convexity: Online Submodular Minimization
8 0.093368597 24 nips-2009-Adapting to the Shifting Intent of Search Queries
9 0.088627093 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering
10 0.074421853 48 nips-2009-Bootstrapping from Game Tree Search
11 0.069407724 220 nips-2009-Slow Learners are Fast
12 0.061074883 113 nips-2009-Improving Existing Fault Recovery Policies
13 0.055515625 53 nips-2009-Complexity of Decentralized Control: Special Cases
14 0.054504473 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
15 0.042587407 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
16 0.042066894 157 nips-2009-Multi-Label Prediction via Compressed Sensing
17 0.038504716 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
18 0.037466604 181 nips-2009-Online Learning of Assignments
19 0.036612365 76 nips-2009-Efficient Learning using Forward-Backward Splitting
20 0.034893878 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization
topicId topicWeight
[(0, -0.108), (1, 0.116), (2, 0.144), (3, -0.187), (4, -0.109), (5, 0.106), (6, -0.048), (7, 0.006), (8, 0.038), (9, 0.016), (10, -0.309), (11, -0.35), (12, -0.196), (13, -0.066), (14, -0.04), (15, -0.144), (16, -0.058), (17, 0.062), (18, 0.043), (19, -0.027), (20, 0.04), (21, 0.048), (22, -0.013), (23, -0.004), (24, 0.075), (25, 0.035), (26, 0.054), (27, -0.003), (28, 0.054), (29, 0.004), (30, -0.044), (31, 0.052), (32, 0.042), (33, -0.05), (34, -0.089), (35, 0.037), (36, 0.044), (37, 0.006), (38, -0.065), (39, 0.017), (40, 0.039), (41, 0.018), (42, 0.027), (43, -0.075), (44, -0.012), (45, -0.027), (46, 0.038), (47, 0.038), (48, -0.048), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.9724164 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
2 0.90023428 232 nips-2009-Strategy Grafting in Extensive Games
Author: Kevin Waugh, Nolan Bard, Michael Bowling
Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1
3 0.7758764 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
4 0.70305026 161 nips-2009-Nash Equilibria of Static Prediction Games
Author: Michael Brückner, Tobias Scheffer
Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1
5 0.55871546 14 nips-2009-A Parameter-free Hedging Algorithm
Author: Kamalika Chaudhuri, Yoav Freund, Daniel J. Hsu
Abstract: We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret. 1
6 0.41571543 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering
7 0.39512932 48 nips-2009-Bootstrapping from Game Tree Search
8 0.37466821 24 nips-2009-Adapting to the Shifting Intent of Search Queries
9 0.26139861 220 nips-2009-Slow Learners are Fast
10 0.25878063 178 nips-2009-On Stochastic and Worst-case Models for Investing
11 0.24285658 45 nips-2009-Beyond Convexity: Online Submodular Minimization
12 0.198388 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale
13 0.19449033 76 nips-2009-Efficient Learning using Forward-Backward Splitting
14 0.19376113 113 nips-2009-Improving Existing Fault Recovery Policies
15 0.18920352 234 nips-2009-Streaming k-means approximation
16 0.18664315 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization
17 0.18552963 181 nips-2009-Online Learning of Assignments
18 0.17972948 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
19 0.16148303 193 nips-2009-Potential-Based Agnostic Boosting
20 0.15406907 53 nips-2009-Complexity of Decentralized Control: Special Cases
topicId topicWeight
[(24, 0.139), (25, 0.035), (35, 0.037), (36, 0.099), (39, 0.046), (49, 0.297), (57, 0.041), (58, 0.038), (61, 0.022), (71, 0.08), (81, 0.017), (86, 0.036), (91, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.81888878 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling
Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1
2 0.69483924 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur
Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.
3 0.55568671 232 nips-2009-Strategy Grafting in Extensive Games
Author: Kevin Waugh, Nolan Bard, Michael Bowling
Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1
4 0.54527265 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
5 0.54256159 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
6 0.54173476 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
7 0.53586072 45 nips-2009-Beyond Convexity: Online Submodular Minimization
8 0.52627903 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
9 0.5247106 181 nips-2009-Online Learning of Assignments
10 0.52459341 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
11 0.51684272 239 nips-2009-Submodularity Cuts and Applications
12 0.51158637 161 nips-2009-Nash Equilibria of Static Prediction Games
13 0.50997806 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
14 0.50615168 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
15 0.49616507 14 nips-2009-A Parameter-free Hedging Algorithm
16 0.49084944 178 nips-2009-On Stochastic and Worst-case Models for Investing
17 0.49024603 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
18 0.49000174 24 nips-2009-Adapting to the Shifting Intent of Search Queries
19 0.48993814 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
20 0.48903641 260 nips-2009-Zero-shot Learning with Semantic Output Codes