nips nips2007 nips2007-165 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione
Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we describe a new technique for solving large games based on regret minimization. [sent-11, score-0.636]
2 In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. [sent-12, score-0.539]
3 We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. [sent-13, score-0.869]
4 Solution techniques for very large extensive games have received considerable attention recently, with poker becoming a common measuring stick for performance. [sent-19, score-0.649]
5 Poker games can be modeled very naturally as an extensive game, with even small variants, such as two-player, limit Texas Hold’em, being impractically large with just under 1018 game states. [sent-20, score-0.63]
6 The representation is linear in the number of game states, rather than exponential, but considerable additional technology is still needed to handle games the size of poker. [sent-22, score-0.5]
7 Abstraction, both hand-chosen [2] and automated [3], is commonly employed to reduce the game from 1018 to a tractable number of game states (e. [sent-23, score-0.692]
8 Solving larger abstractions yields better approximate Nash equilibria in the original game, making techniques for solving larger games the focus of research in this area. [sent-27, score-0.37]
9 These techniques have been shown capable of finding approximate solutions to abstractions with as many as 1010 game states [5, 6, 7], resulting in the first significant improvement in poker programs in the past four years. [sent-29, score-0.876]
10 The technique is based on regret minimization, using a new concept called counterfactual regret. [sent-31, score-0.862]
11 We show that minimizing counterfactual regret minimizes overall regret, and therefore can be used to compute a Nash equilibrium. [sent-32, score-0.869]
12 We then present an algorithm for minimizing counterfactual regret in poker. [sent-33, score-0.833]
13 We use the algorithm to solve poker abstractions with as many as 1012 game states, two orders of magnitude larger than previous methods. [sent-34, score-0.847]
14 We also show that this translates directly into an improvement in the strength of the resulting poker playing programs. [sent-35, score-0.462]
15 We begin with a formal description of extensive games followed by an overview of regret minimization and its connections to Nash equilibria. [sent-36, score-0.762]
16 The core of an extensive game is a game tree just as in perfect information games (e. [sent-39, score-1.014]
17 Each non-terminal game state has an associated player choosing actions and every terminal state has associated payoffs for each of the players. [sent-42, score-0.722]
18 The key difference is the additional constraint of information sets, which are sets of game states that the controlling player cannot distinguish and so must choose actions for all such states with the same distribution. [sent-43, score-0.781]
19 In poker, for example, the first player to act does not know which cards the other players were dealt, and so all game states immediately following the deal where the first player holds the same cards would be in the same information set. [sent-44, score-1.176]
20 200] a finite extensive game with imperfect information has the following components: • A finite set N of players. [sent-47, score-0.5]
21 P (h) is the player who takes an action after the history h. [sent-52, score-0.38]
22 Ii is the information partition of player i; a set Ii ∈ Ii is an information set of player i. [sent-57, score-0.68]
23 • For each player i ∈ N a utility function ui from the terminal states Z to the reals R. [sent-58, score-0.494]
24 Define ∆u,i = maxz ui (z) − minz ui (z) to be the range of utilities to player i. [sent-60, score-0.48]
25 If all players can recall their previous actions and the corresponding information sets, the game is said to be one of perfect recall. [sent-62, score-0.478]
26 1 Strategies A strategy of player i σi in an extensive game is a function that assigns a distribution over A(Ii ) to each Ii ∈ Ii , and Σi is the set of strategies for player i. [sent-65, score-1.204]
27 Hence, πi (h) is the probability that if player i plays according to σ then for all histories h that are a proper prefix σ of h with P (h ) = i, player i takes the corresponding action in h. [sent-72, score-0.675]
28 The overall value to player i of a strategy profile is then the expected payoff of the resulting terminal node, ui (σ) = h∈Z ui (h)π σ (h). [sent-75, score-0.656]
29 2 Nash Equilibrium The traditional solution concept of a two-player extensive game is that of a Nash equilibrium. [sent-77, score-0.485]
30 σ1 ∈Σ1 σ2 ∈Σ2 (1) An approximation of a Nash equilibrium or -Nash equilibrium is a strategy profile σ where u2 (σ) + ≥ max u2 (σ1 , σ2 ). [sent-79, score-0.376]
31 Let σi be the strategy used by player i on round t. [sent-83, score-0.474]
32 The average overall regret of player i at time T is: T T Ri = 1 ∗ t ui (σi , σ−i ) − ui (σ t ) max ∗ T σi ∈Σi t=1 (3) Moreover, define σi to be the average strategy for player i from time 1 to T . [sent-84, score-1.341]
33 (4) There is a well-known connection between regret and the Nash equilibrium solution concept. [sent-86, score-0.576]
34 Theorem 2 In a zero-sum game at time T , if both player’s average overall regret is less than , then σ T is a 2 equilibrium. [sent-87, score-0.794]
35 ¯ t An algorithm for selecting σi for player i is regret minimizing if player i’s average overall regret t (regardless of the sequence σ−i ) goes to zero as t goes to infinity. [sent-88, score-1.545]
36 As a result, regret minimizing algorithms in self-play can be used as a technique for computing an approximate Nash equilibrium. [sent-89, score-0.494]
37 Moreover, an algorithm’s bounds on the average overall regret bounds the rate of convergence of the approximation. [sent-90, score-0.467]
38 Traditionally, regret minimization has focused on bandit problems more akin to normal-form games. [sent-91, score-0.459]
39 Although it is conceptually possible to convert any finite extensive game to an equivalent normalform game, the exponential increase in the size of the representation makes the use of regret algorithms on the resulting game impractical. [sent-92, score-1.234]
40 Recently, Gordon has introduced the Lagrangian Hedging (LH) family of algorithms, which can be used to minimize regret in extensive games by working with the realization plan representation [5]. [sent-93, score-0.734]
41 We also propose a regret minimization procedure that exploits the compactness of the extensive game. [sent-94, score-0.609]
42 However, our technique doesn’t require the costly quadratic programming optimization needed with LH allowing it to scale more easily, while achieving even tighter regret bounds. [sent-95, score-0.463]
43 3 3 Counterfactual Regret The fundamental idea of our approach is to decompose overall regret into a set of additive regret terms, which can be minimized independently. [sent-96, score-0.916]
44 In particular, we introduce a new regret concept for extensive games called counterfactual regret, which is defined on an individual information set. [sent-97, score-1.151]
45 We show that overall regret is bounded by the sum of counterfactual regret, and also show how counterfactual regret can be minimized at each information set independently. [sent-98, score-1.676]
46 As we will often be most concerned about regret when it T,+ T is positive, let Ri,imm (I) = max(Ri,imm (I), 0) be the positive portion of immediate counterfactual regret. [sent-103, score-0.852]
47 Since minimizing immediate counterfactual regret minimizes the overall regret, it enables us to find an approximate Nash equilibrium if we can only minimize the immediate counterfactual regret. [sent-106, score-1.443]
48 The key feature of immediate counterfactual regret is that it can be minimized by controlling only σi (I). [sent-107, score-0.849]
49 To this end, we can use Blackwell’s algorithm for approachability to minimize this regret independently on each information set. [sent-108, score-0.449]
50 (8) |A(I)| In other words, actions are selected in proportion to the amount of positive counterfactual regret for not playing that action. [sent-110, score-0.911]
51 If no actions have any positive counterfactual regret, then the action is selected randomly. [sent-111, score-0.45]
52 In addition, the bound on the average overall regret is linear in the number of information sets. [sent-116, score-0.485]
53 Meanwhile, minimizing counterfactual regret does not require a costly quadratic program projection on each iteration. [sent-118, score-0.852]
54 4 4 Application To Poker We now describe how we use counterfactual regret minimization to compute a near equilibrium solution in the domain of poker. [sent-120, score-0.975]
55 The poker variant we focus on is heads-up limit Texas Hold’em, as it is used in the AAAI Computer Poker Competition [9]. [sent-121, score-0.364]
56 The game consists of two players (zero-sum), four rounds of cards being dealt, and four rounds of betting, and has just under 1018 game states [2]. [sent-122, score-0.905]
57 As with all previous work on this domain, we will first abstract the game and find an equilibrium of the abstracted game. [sent-123, score-0.472]
58 In the terminology of extensive games, we will merge information sets; in the terminology of poker, we will bucket card sequences. [sent-124, score-0.446]
59 Hence, the ability to solve a larger game means less abstraction is required, translating into a stronger poker playing program. [sent-127, score-0.99]
60 1 Abstraction The goal of abstraction is to reduce the number of information sets for each player to a tractable size such that the abstract game can be solved. [sent-129, score-0.867]
61 Early poker abstractions [2, 4] involved limiting the possible sequences of bets, e. [sent-130, score-0.505]
62 More recently, abstractions involving full four round games with the full four bets per round have proven to be a significant improvement [7, 6]. [sent-133, score-0.585]
63 Our abstraction groups together observed card sequences based on a metric called hand strength squared. [sent-135, score-0.476]
64 Hand strength is the expected probability of winning1 given only the cards a player has seen. [sent-136, score-0.406]
65 Hand strength squared is the expected square of the hand strength after the last card is revealed, given only the cards a player has seen. [sent-138, score-0.617]
66 Intuitively, hand strength squared is similar to hand strength but gives a bonus to card sequences whose eventual hand strength has higher variance. [sent-139, score-0.388]
67 More importantly, we will show in Section 5 that this metric for abstraction results in stronger poker strategies. [sent-141, score-0.586]
68 The final abstraction is generated by partitioning card sequences based on the hand strength squared metric. [sent-142, score-0.476]
69 Then, all round-two card sequences that shared a round-one bucket are partitioned into ten equally sized buckets based on the metric now applied at round two. [sent-146, score-0.61]
70 Thus, a partition of card sequences in round two is a pair of numbers: its bucket in the previous round and its bucket in the current round given its bucket in the previous round. [sent-147, score-0.917]
71 This is repeated after reach round, continuing to partition card sequences that agreed on the previous rounds’ buckets into ten equally sized buckets based on the metric applied in that round. [sent-148, score-0.464]
72 Thus, card sequences are partitioned into bucket sequences: a bucket from {1, . [sent-149, score-0.523]
73 So although this represents a significant abstraction on the original game it is two orders of magnitude larger than previously solved abstractions. [sent-159, score-0.607]
74 2 Minimizing Counterfactual Regret Now that we have specified an abstraction, we can use counterfactual regret minimization to compute an approximate equilibrium for this game. [sent-161, score-0.975]
75 The basic procedure involves having two players repeatedly play the game using the counterfactual regret minimizing strategy from Equation 8. [sent-162, score-1.326]
76 In the full version, we show that the bound for poker is actually independent of the size of the card abstraction. [sent-166, score-0.519]
77 2 5 For our experiments, we actually use a variation of this basic procedure, which exploits the fact that our abstraction has a small number of information sets relative to the number of game states. [sent-167, score-0.579]
78 5 Experimental Results Before discussing the results, it is useful to consider how one evaluates the strength of a near equilibrium poker strategy. [sent-181, score-0.531]
79 In a symmetric, zero-sum game like heads-up poker4 , a perfect equilibrium has zero exploitability, while an -Nash equilibrium has exploitability . [sent-183, score-0.752]
80 To provide some intuition for these numbers, a player that always folds will lose 750 mb/h while a player that is 10 mb/h stronger than another would require over one million hands to be 95% certain to have won overall. [sent-185, score-0.701]
81 For strategies in a reasonably sized abstraction it is possible to compute their exploitability within their own abstract game. [sent-187, score-0.425]
82 It is therefore common to compare the performance of the strategy in the full game against a battery of known strong poker playing programs. [sent-190, score-0.841]
83 We used the sampled counterfactual regret minimization procedure to find an approximate equilibrium for our abstract game as described in the previous section. [sent-192, score-1.302]
84 The resulting strategy’s exploitability within its own abstract game is 2. [sent-194, score-0.461]
85 Notice that the algorithm visits only 18,496 game states per iteration. [sent-197, score-0.365]
86 After 200 million iterations each game state has been visited less than 2. [sent-198, score-0.393]
87 These abstractions used fewer buckets per round to partition the card sequences. [sent-202, score-0.444]
88 In addition to ten buckets, we also solved eight, six, and five 3 A regret analysis of this variant in poker is included in the full version. [sent-203, score-0.846]
89 4 A single hand of poker is not a symmetric game as the order of betting is strategically significant. [sent-206, score-0.765]
90 For example, the program computed with the five bucket approximation (CFR5) is about 250 times smaller with just under 1010 game states. [sent-219, score-0.496]
91 The y-axis is exploitability while the x-axis is the number of iterations normalized by the number of information sets in the particular abstraction being plotted. [sent-224, score-0.389]
92 2 Performance in Full Texas Hold’em We have noted that the ability to solve larger games means less abstraction is necessary, resulting in an overall stronger poker playing program. [sent-229, score-0.891]
93 We have played our four near equilibrium bots with various abstraction sizes against each other and two other known strong programs: PsOpti4 and S2298. [sent-230, score-0.443]
94 In terms of the size of the abstract game PsOpti4 is the smallest consisting of a small number of merged three round games. [sent-238, score-0.407]
95 S2298 restricts the number of bets per round to 3 and uses a five bucket per round card abstraction based on hand-strength, resulting an abstraction slightly smaller than CFR5. [sent-239, score-0.963]
96 6 Conclusion We introduced a new regret concept for extensive games called counterfactual regret. [sent-252, score-1.133]
97 We showed that minimizing counterfactual regret minimizes overall regret and presented a general and pokerspecific algorithm for efficiently minimizing counterfactual regret. [sent-253, score-1.702]
98 We demonstrated the technique in the domain of poker, showing that the technique can compute an approximate equilibrium for abstractions with as many as 1012 states, two orders of magnitude larger than previous methods. [sent-254, score-0.383]
99 We also showed that the resulting poker playing program outperforms other strong programs, including all of the competitors from the bankroll portion of the 2006 AAAI Computer Poker Competition. [sent-255, score-0.485]
100 A competitive texas hold’em poker player via automated abstraction and real-time equilibrium computation. [sent-279, score-1.078]
wordName wordTfidf (topN-words)
[('regret', 0.431), ('counterfactual', 0.371), ('poker', 0.346), ('game', 0.327), ('player', 0.308), ('abstraction', 0.214), ('games', 0.173), ('bucket', 0.15), ('card', 0.148), ('equilibrium', 0.145), ('nash', 0.139), ('extensive', 0.13), ('exploitability', 0.115), ('abstractions', 0.108), ('strategy', 0.086), ('ui', 0.086), ('round', 0.08), ('buckets', 0.08), ('betting', 0.069), ('bots', 0.066), ('texas', 0.065), ('players', 0.061), ('ii', 0.06), ('ri', 0.059), ('bets', 0.058), ('cards', 0.058), ('playing', 0.057), ('actions', 0.052), ('sequences', 0.051), ('sized', 0.051), ('equilibria', 0.049), ('history', 0.045), ('strategies', 0.045), ('iterations', 0.042), ('strength', 0.04), ('states', 0.038), ('overall', 0.036), ('aaai', 0.036), ('hands', 0.035), ('fc', 0.035), ('terminal', 0.035), ('gilpin', 0.035), ('competition', 0.034), ('pre', 0.033), ('technique', 0.032), ('bowling', 0.032), ('histories', 0.032), ('minimizing', 0.031), ('rounds', 0.029), ('immediate', 0.029), ('minimization', 0.028), ('zinkevich', 0.028), ('partition', 0.028), ('concept', 0.028), ('hold', 0.027), ('action', 0.027), ('pro', 0.027), ('utility', 0.027), ('hedging', 0.027), ('nonterminal', 0.027), ('stronger', 0.026), ('ten', 0.026), ('le', 0.026), ('dealt', 0.025), ('winning', 0.025), ('imperfect', 0.025), ('full', 0.025), ('million', 0.024), ('partitioned', 0.024), ('orders', 0.024), ('lh', 0.023), ('bankroll', 0.023), ('exploitable', 0.023), ('hyperborean', 0.023), ('johanson', 0.023), ('winnings', 0.023), ('hand', 0.023), ('em', 0.022), ('member', 0.022), ('magnitude', 0.022), ('portion', 0.021), ('matches', 0.021), ('duplicate', 0.021), ('perfect', 0.02), ('larger', 0.02), ('programs', 0.02), ('exploits', 0.02), ('multiagent', 0.02), ('parallelization', 0.02), ('play', 0.019), ('resulting', 0.019), ('program', 0.019), ('days', 0.019), ('core', 0.019), ('edmonton', 0.019), ('chance', 0.018), ('variant', 0.018), ('information', 0.018), ('minimized', 0.018), ('four', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 165 nips-2007-Regret Minimization in Games with Incomplete Information
Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione
Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1
2 0.52367371 55 nips-2007-Computing Robust Counter-Strategies
Author: Michael Johanson, Martin Zinkevich, Michael Bowling
Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1
3 0.44213694 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
4 0.22260118 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
5 0.20130266 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
6 0.17553049 21 nips-2007-Adaptive Online Gradient Descent
7 0.15652271 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
8 0.13227229 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
9 0.1163986 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
10 0.11170844 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
11 0.047295071 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
12 0.040590908 52 nips-2007-Competition Adds Complexity
13 0.037926868 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
14 0.036829785 162 nips-2007-Random Sampling of States in Dynamic Programming
15 0.031878941 100 nips-2007-Hippocampal Contributions to Control: The Third Way
16 0.029077701 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
17 0.028636295 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
18 0.028533423 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
19 0.028160073 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
20 0.027152285 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
topicId topicWeight
[(0, -0.13), (1, -0.27), (2, 0.04), (3, 0.046), (4, 0.572), (5, -0.05), (6, -0.168), (7, 0.146), (8, 0.043), (9, 0.175), (10, -0.211), (11, -0.204), (12, 0.08), (13, -0.072), (14, 0.1), (15, 0.015), (16, -0.055), (17, 0.07), (18, -0.018), (19, -0.022), (20, 0.133), (21, 0.017), (22, 0.008), (23, 0.121), (24, -0.01), (25, 0.046), (26, 0.045), (27, 0.11), (28, 0.008), (29, -0.105), (30, -0.013), (31, 0.041), (32, -0.044), (33, -0.012), (34, -0.011), (35, 0.007), (36, -0.022), (37, 0.011), (38, 0.018), (39, -0.034), (40, 0.013), (41, 0.006), (42, -0.021), (43, 0.019), (44, 0.056), (45, 0.02), (46, -0.037), (47, 0.031), (48, -0.035), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.98589575 165 nips-2007-Regret Minimization in Games with Incomplete Information
Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione
Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1
2 0.88587409 55 nips-2007-Computing Robust Counter-Strategies
Author: Michael Johanson, Martin Zinkevich, Michael Bowling
Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1
3 0.80716956 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
4 0.52382547 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
5 0.49174485 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
Author: Robert Peters, Laurent Itti
Abstract: Current computational models of bottom-up and top-down components of attention are predictive of eye movements across a range of stimuli and of simple, fixed visual tasks (such as visual search for a target among distractors). However, to date there exists no computational framework which can reliably mimic human gaze behavior in more complex environments and tasks, such as driving a vehicle through traffic. Here, we develop a hybrid computational/behavioral framework, combining simple models for bottom-up salience and top-down relevance, and looking for changes in the predictive power of these components at different critical event times during 4.7 hours (500,000 video frames) of observers playing car racing and flight combat video games. This approach is motivated by our observation that the predictive strengths of the salience and relevance models exhibit reliable temporal signatures during critical event windows in the task sequence—for example, when the game player directly engages an enemy plane in a flight combat game, the predictive strength of the salience model increases significantly, while that of the relevance model decreases significantly. Our new framework combines these temporal signatures to implement several event detectors. Critically, we find that an event detector based on fused behavioral and stimulus information (in the form of the model’s predictive strength) is much stronger than detectors based on behavioral information alone (eye position) or image information alone (model prediction maps). This approach to event detection, based on eye tracking combined with computational models applied to the visual input, may have useful applications as a less-invasive alternative to other event detection approaches based on neural signatures derived from EEG or fMRI recordings. 1
6 0.43844664 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
7 0.43687886 52 nips-2007-Competition Adds Complexity
8 0.34680378 199 nips-2007-The Price of Bandit Information for Online Optimization
9 0.26742011 21 nips-2007-Adaptive Online Gradient Descent
10 0.22788909 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
11 0.21170884 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
12 0.20985234 171 nips-2007-Scan Strategies for Meteorological Radars
13 0.13986139 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
14 0.12673078 174 nips-2007-Selecting Observations against Adversarial Objectives
15 0.11791665 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
16 0.11739152 100 nips-2007-Hippocampal Contributions to Control: The Third Way
17 0.10553996 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
18 0.099128649 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
19 0.099050581 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
20 0.098255679 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
topicId topicWeight
[(5, 0.02), (13, 0.022), (16, 0.031), (21, 0.041), (29, 0.225), (31, 0.037), (34, 0.027), (35, 0.02), (47, 0.059), (49, 0.013), (68, 0.237), (83, 0.082), (85, 0.018), (87, 0.016), (90, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.75362885 165 nips-2007-Regret Minimization in Games with Incomplete Information
Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione
Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1
2 0.67557031 55 nips-2007-Computing Robust Counter-Strategies
Author: Michael Johanson, Martin Zinkevich, Michael Bowling
Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1
3 0.59939981 169 nips-2007-Retrieved context and the discovery of semantic structure
Author: Vinayak Rao, Marc Howard
Abstract: Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reflects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented. 1
4 0.57006437 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
Author: Yuanqing Lin, Jingdong Chen, Youngmoo Kim, Daniel D. Lee
Abstract: Speech dereverberation remains an open problem after more than three decades of research. The most challenging step in speech dereverberation is blind channel identification (BCI). Although many BCI approaches have been developed, their performance is still far from satisfactory for practical applications. The main difficulty in BCI lies in finding an appropriate acoustic model, which not only can effectively resolve solution degeneracies due to the lack of knowledge of the source, but also robustly models real acoustic environments. This paper proposes a sparse acoustic room impulse response (RIR) model for BCI, that is, an acoustic RIR can be modeled by a sparse FIR filter. Under this model, we show how to formulate the BCI of a single-input multiple-output (SIMO) system into a l1 norm regularized least squares (LS) problem, which is convex and can be solved efficiently with guaranteed global convergence. The sparseness of solutions is controlled by l1 -norm regularization parameters. We propose a sparse learning scheme that infers the optimal l1 -norm regularization parameters directly from microphone observations under a Bayesian framework. Our results show that the proposed approach is effective and robust, and it yields source estimates in real acoustic environments with high fidelity to anechoic chamber measurements.
5 0.51668155 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
6 0.4656229 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
7 0.45485911 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
8 0.37436959 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
9 0.37407932 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
10 0.36758474 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
11 0.36705476 63 nips-2007-Convex Relaxations of Latent Variable Training
12 0.36549556 174 nips-2007-Selecting Observations against Adversarial Objectives
13 0.36513931 156 nips-2007-Predictive Matrix-Variate t Models
14 0.36407349 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
15 0.36305389 128 nips-2007-Message Passing for Max-weight Independent Set
16 0.36236599 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
17 0.36101389 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
18 0.36042643 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
19 0.36037889 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
20 0.35907269 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model