nips nips2012 nips2012-109 knowledge-graph by maker-knowledge-mining

109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions


Source: pdf

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. [sent-4, score-0.945]

2 1 Introduction An extensive-form game is a common formalism used to model sequential decision making problems containing multiple agents, imperfect information, and chance events. [sent-9, score-0.321]

3 Monte Carlo CFR (MCCFR) [9] can be used to reduce the traversal time per iteration by considering only a sampled portion of the game tree. [sent-16, score-0.294]

4 For example, Chance Sampling (CS) [12] is an instance of MCCFR that only traverses the portion of the game tree corresponding to a single, sampled sequence of chance’s actions. [sent-17, score-0.317]

5 However, in games where a player has many possible actions, such as no-limit poker, iterations of CS are still very time consuming. [sent-18, score-0.732]

6 This is because CS considers all possible player actions, even if many actions are poor or only factor little into the algorithm’s computation. [sent-19, score-0.674]

7 Our main contribution in this paper is a new MCCFR algorithm that samples player actions and is suitable for games involving many player choices. [sent-20, score-1.406]

8 By using a player’s average strategy to sample actions, convergence time is significantly reduced in large games with many player actions. [sent-23, score-0.822]

9 1 2 Background A finite extensive game contains a game tree with nodes corresponding to histories of actions h ∈ H and edges corresponding to actions a ∈ A(h) available to player P (h) ∈ N ∪ {c} (where N is the set of players and c denotes chance). [sent-25, score-1.568]

10 Each terminal history z ∈ Z has associated utilities ui (z) for each player i. [sent-27, score-0.677]

11 We define ∆i = maxz,z ∈Z ui (z) − ui (z ) to be the range of utilities for player i. [sent-28, score-0.673]

12 Non-terminal histories are partitioned into information sets I ∈ Ii representing the different game states that player i cannot distinguish between. [sent-29, score-0.856]

13 For example, in poker, player i does not see the private cards dealt to the opponents, and thus all histories differing only in the private cards of the opponents are in the same information set for player i. [sent-30, score-1.345]

14 We define |Ai | = maxI∈Ii |A(I)| to be the maximum number of actions available to player i at any information set. [sent-32, score-0.674]

15 A (behavioral) strategy for player i, σi ∈ Σi , is a function that maps each information set I ∈ Ii to a probability distribution over A(I). [sent-34, score-0.607]

16 Let ui (σ) denote the expected utility for player i, given that all players play according to σ. [sent-39, score-0.753]

17 Let π σ (h) be the probability of history h occurring if all players choose actions according to σ. [sent-41, score-0.278]

18 We can decompose π σ (h) = σ σ i∈N ∪{c} πi (h), where πi (h) is the contribution to this probability from player i when playing σ according to σi (or from chance when i = c). [sent-42, score-0.626]

19 Let π−i (h) be the product of all players’ contributions σ (including chance) except that of player i. [sent-43, score-0.547]

20 Furthermore, for I ∈ Ii , the probability of player i playing to reach I σ σ is πi (I) = πi (h) for any h ∈ I, which is well-defined due to perfect recall. [sent-45, score-0.547]

21 A best response to σ−i is a strategy that maximizes player i’s expected payoff against σ−i . [sent-46, score-0.629]

22 The best response value for player i is the value of that strategy, bi (σ−i ) = maxσi ∈Σi ui (σi , σ−i ). [sent-47, score-0.671]

23 A strategy profile σ is an -Nash equilibrium if no player can unilaterally deviate from σ and gain more than ; i. [sent-48, score-0.657]

24 In this case, the exploitability of σ, e(σ) = (b1 (σ2 )+b2 (σ1 ))/2, measures how much σ loses to a worst case opponent when players alternate positions. [sent-52, score-0.463]

25 CFR has also been shown to work well in games with more than two players [1, 3]. [sent-55, score-0.303]

26 On each iteration t, the base algorithm, “vanilla” CFR, traverses the entire game tree once per player, computing the expected utility for player i at each information set I ∈ Ii under the current profile σ t , assuming player i plays to reach I. [sent-56, score-1.379]

27 This σ expectation is the counterfactual value for player i, vi (I, σ) = z∈ZI ui (z)π−i (z[I])π σ (z[I], z), where ZI is the set of terminal histories passing through I and z[I] is that history along z contained in I. [sent-57, score-0.973]

28 For each action a ∈ A(I), these values determine the counterfactual regret at iteration t, t t ri (I, a) = vi (I, σ(I→a) ) − vi (I, σ t ), t where σ(I→a) is the profile σ except that at I, action a is always taken. [sent-58, score-0.698]

29 The regret ri (I, a) measures t how much player i would rather play action a at I than play σ . [sent-59, score-0.953]

30 The original CFR analysis shows that player i’s regret T,+ is bounded by the sum of the positive parts of the cumulative counterfactual regrets Ri (I, a): Theorem 1 (Zinkevich et al. [sent-63, score-1.068]

31 T Ri ≤ I∈I a∈A(I) Regret matching minimizes the average of the cumulative counterfactual regrets, and so player i’s average regret is also minimized by Theorem 1. [sent-65, score-1.063]

32 For each player i, let Bi be the partition of Ii such that two information sets I, I are in the same part B ∈ Bi if and only if player i’s sequence of actions leading to I is the same as the sequence of actions leading to I . [sent-66, score-1.348]

33 Next, define the M -value of the game to player i to be Mi = B∈Bi |B|. [sent-68, score-0.789]

34 The best known bound on player i’s average regret is: Theorem 2 (Lanctot et al. [sent-69, score-0.774]

35 [9]) When using vanilla CFR, average regret is bounded by T ∆i Mi |Ai | Ri √ ≤ . [sent-70, score-0.282]

36 For example, Chance Sampling (CS) [12] is an instance of MCCFR that partitions Z into blocks such that two histories are in the same block if and only if no two chance actions differ. [sent-78, score-0.273]

37 In general, the sampled counterfactual value for player i is σ ui (z)π−i (z[I])π σ (z[I], z)/q(z), vi (I, σ) = ˜ z∈ZI ∩Qj where q(z) = k:z∈Qk qk is the probability that z was sampled. [sent-81, score-0.896]

38 Define the sampled counterfactual regret for action a at I to be ri (I, a) = vi (I, σ(I→a) ) − ˜t ˜ T ˜ vi (I, σ t ). [sent-83, score-0.643]

39 ˜ i t=1 i CS has been shown to significantly reduce computing time in poker games [11, Appendix A. [sent-85, score-0.303]

40 ES takes CS one step further by considering only a single action for not only chance, but also for t the opponents, where opponent actions are sampled according to the current profile σ−i . [sent-89, score-0.349]

41 Since both algorithms generate blocks by sampling ¯ actions independently, we can decompose q(z) = i∈N ∪{c} qi (z) so that qi (z) is the probability contributed to q(z) by sampling player i’s actions. [sent-93, score-0.812]

42 [12] bound a player’s regret by a sum of cumulative counterfactual regrets (Theorem 1), we can actually equate a player’s regret to a weighted sum of counterfacT tual regrets. [sent-101, score-0.718]

43 For a strategy σi ∈ Σi and an information set I ∈ Ii , define Ri (I, σi ) = T ∗ In addition, let σi ∈ Σi be a player i strategy such that a∈A(I) σi (I, a)Ri (I, a). [sent-102, score-0.667]

44 For vanilla CFR, we can simply replace Mi in Theorem 2 with Mi (σi ): Theorem 5 When using vanilla CFR, average regret is bounded by ∗ T ∆i Mi (σi ) |Ai | Ri √ . [sent-110, score-0.337]

45 Theorem 4 states that player i’s regret is equal to the weighted sum of player i’s counterfactual ∗ regrets at each I ∈ Ii , where the weights are equal to player i’s probability of reaching I under σi . [sent-116, score-2.105]

46 Since our goal is to minimize average regret, this means that we only need to minimize the average ∗ cumulative counterfactual regret at each I ∈ Ii that σi plays to reach. [sent-117, score-0.536]

47 4 Average Strategy Sampling Leveraging the theory developed in the previous section, we now introduce a new MCCFR sampling algorithm that can minimize average regret at a faster rate than CS, ES, and OS. [sent-120, score-0.295]

48 This means that player i’s ¯ average strategy, σi , converges to a best response of σ−i . [sent-125, score-0.599]

49 Our new sampling algorithm, Average Strategy Sampling (AS), selects actions for player i according to the cumulative profile and three predefined parameters. [sent-127, score-0.774]

50 AS can be seen as a sampling scheme between OS and ES where a subset of player i’s actions are sampled at each information set I, as opposed to sampling one action (OS) or sampling every action (ES). [sent-128, score-1.006]

51 As in ES, at opponent and i T chance nodes, a single action is sampled on-policy according to the current opponent profile σ−i and the fixed chance probabilities σc respectively. [sent-130, score-0.485]

52 Since player i’s average strategy σi is not a good ¯T i ∗ approximation of σi for small T , we include β to avoid making ill-informed choices early-on. [sent-135, score-0.637]

53 Firstly, if we have reached a terminal node, we return the utility scaled by 1/q (line 5), where q = qi (z) is the probability of sampling z contributed from player i’s actions. [sent-140, score-0.65]

54 Secondly, when at a chance node, we sample a single action according to σc and recurse down that action (line 6). [sent-141, score-0.287]

55 Thirdly, at an opponent’s choice node (lines 8 to 11), we again sample a single action and recurse, this time according to the opponent’s current strategy obtained via regret matching (equation (1)). [sent-142, score-0.343]

56 The final case in Algorithm 1 handles choice nodes for player i (lines 7 to 17). [sent-145, score-0.601]

57 If we do sample a, then we recurse to obtain t the sampled counterfactual value v (a) = vi (I, σ(I→a) ) (line 14). [sent-147, score-0.296]

58 Finally, we update the regrets at I ˜ ˜ (line 16) and return the sampled counterfactual value at I, a∈A(I) σ(I, a)˜(a) = vi (I, σ t ). [sent-148, score-0.325]

59 However, for CS and ES, δ = 1 since all of player i’s actions are sampled, whereas δ ≤ 1 for OS and AS. [sent-152, score-0.674]

60 While this suggests that fewer iterations of CS or ES are required to achieve the same regret bound compared to OS and AS, iterations for OS and AS are faster as they traverse less of the game tree. [sent-153, score-0.464]

61 While AS can be applied to any extensive game, the aim of AS is to provide faster convergence rates in games involving many player actions. [sent-156, score-0.778]

62 The two-player poker game we consider here, which we call 2-NL Hold’em(k), is inspired by no-limit Texas Hold’em. [sent-159, score-0.36]

63 To begin play, the player denoted as the dealer posts a small blind of one chip and the other player posts a big blind of two chips. [sent-162, score-1.126]

64 Each player is then dealt two private cards from a standard 52-card deck and the first betting round begins. [sent-163, score-0.746]

65 During each betting round, players can either fold (forfeit the game), call (match the previous bet), or raise by any number of chips in their remaining stack (increase the previous bet), as long as the raise is at least as big as the previous bet. [sent-164, score-0.278]

66 If a player has no more chips left after a call or a raise, that player is said to be all-in. [sent-166, score-1.133]

67 At the end of the second betting round, if neither player folded, then the player with the highest ranked five-card poker hand wins all of the chips played. [sent-167, score-1.326]

68 Note that the number of player actions in 2-NL Hold’em(k) at one information set is at most the starting stack size, k. [sent-168, score-0.674]

69 Bluff(D1 , D2 ) [7], also known as Liar’s Dice, Perduo, and Dudo, is a two-player dice-bidding game played with six-sided dice over a number of rounds. [sent-171, score-0.317]

70 Then, players alternate by bidding a quantity q of a face value f of all dice in play until one player claims that the other is bluffing (i. [sent-174, score-0.765]

71 To place a new bid, a player must increase q or f of the current bid. [sent-177, score-0.547]

72 The player calling bluff wins the round if the opponent’s last bid is incorrect, and loses otherwise. [sent-179, score-0.823]

73 The losing player removes one of their dice from the game and a new round begins. [sent-180, score-0.905]

74 Once a player has no more dice left, that player loses the game and receives a utility of −1, while the winning player earns +1 utility. [sent-181, score-1.981]

75 The maximum number of player actions at an information set is 6(D1 + D2 ) + 1 as increasing Di allows both players to bid higher quantities q. [sent-182, score-0.835]

76 In poker, a common approach is to create an abstract game by merging similar card dealings together into a single chance action or “bucket” [4]. [sent-186, score-0.431]

77 To keep the size of our games manageable, we employ a five-bucket abstraction that reduces the branching factor at each chance node down to five, where dealings are grouped according to expected hand strength squared as described by Zinkevich et al. [sent-187, score-0.309]

78 Figure 1a shows the exploitability in the five-bucket abstract game, measured in milli-big-blinds per game (mbb/g), of the profile produced by AS after 1012 nodes visited. [sent-192, score-0.513]

79 Figure 1b shows the abstract game exploitability over the number of nodes visited by AS in 2-NL Hold’em(30), where again each data point is averaged over five runs. [sent-199, score-0.591]

80 2 0 100 101 102 103 104 105 106 107 108 109 β Abstract game exploitability (mbb/g) 102 1 10 0 100 τ=101 τ=10 τ=102 3 τ=104 τ=10 5 τ=106 τ=10 10-1 10 10 11 10 Nodes Visited 12 10 (b) = 0. [sent-215, score-0.459]

81 05, β = 106 (a) τ = 1000 Figure 1: (a) Abstract game exploitability of AS profiles for τ = 1000 after 1012 nodes visited in 2-NL Hold’em(30). [sent-216, score-0.591]

82 (b) Log-log plot of abstract game exploitability over the number of nodes visited by AS with = 0. [sent-217, score-0.591]

83 [9], our OS implementation is -greedy so that the current player i samples a single action at random with probability = 0. [sent-224, score-0.633]

84 For each game, we ran each of CS, ES, OS, and AS five times, measured the abstract game exploitability at a number of checkpoints, and averaged the results. [sent-231, score-0.459]

85 Figure 2a displays the results for 2-NL Hold’em(36), a game with approximately 68 million information sets and 5 billion histories (nodes). [sent-232, score-0.309]

86 In addition, Figure 2b shows the average exploitability in each of the eleven games after approximately 3. [sent-234, score-0.452]

87 , ∆i becomes larger), we “normalized” exploitability across each game by dividing the units on the y-axis by k. [sent-239, score-0.459]

88 While there is little difference between the algorithms for the smaller 20 and 22 chip games, we see a significant benefit to using AS over CS and ES for the larger games that contain many player actions. [sent-240, score-0.764]

89 We have provided new, tighter bounds on the average regret when using vanilla CFR or one of several different MCCFR sampling algorithms. [sent-248, score-0.359]

90 These bounds were derived by showing that a player’s regret is equal to a weighted sum of the player’s cumulative counterfactual regrets (Theorem 4), where the weights are given by a best response to the opponents’ previous sequence of strategies. [sent-249, score-0.543]

91 16 Abstract game exploitability (mbb/g) / k Abstract game exploitability (mbb/g) 104 1012 0. [sent-252, score-0.918]

92 , 40} (a) 2-NL Hold’em(36) Figure 2: (a) Log-log plot of abstract game exploitability over the number of nodes visited by CS, ES, OS, and AS in 2-NL Hold’em(36). [sent-262, score-0.591]

93 16 × 1012 nodes visited over the game size for 2-NL Hold’em(k) with even-sized starting stacks k between 20 and 40 chips. [sent-265, score-0.398]

94 convergence rates in games containing many player actions. [sent-274, score-0.732]

95 For future work, we would like to apply AS to games with many player actions and with more than two players. [sent-276, score-0.859]

96 All of our theory still applies, except that ∗ player i’s average strategy is no longer guaranteed to converge to σi . [sent-277, score-0.637]

97 Using counterfactual regret minimization to create competitive multiplayer poker agents. [sent-282, score-0.56]

98 A competitive Texas Hold’em poker player via automated abstraction and real-time equilibrium computation. [sent-291, score-0.736]

99 Monte Carlo sampling for regret minimization in extensive games. [sent-306, score-0.283]

100 Monte Carlo sampling for regret minimization in extensive games. [sent-309, score-0.283]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('player', 0.547), ('mccfr', 0.253), ('game', 0.242), ('os', 0.24), ('cfr', 0.236), ('exploitability', 0.217), ('counterfactual', 0.202), ('regret', 0.197), ('games', 0.185), ('bluff', 0.169), ('cs', 0.161), ('actions', 0.127), ('pro', 0.124), ('poker', 0.118), ('players', 0.118), ('opponent', 0.105), ('lanctot', 0.096), ('action', 0.086), ('le', 0.079), ('chance', 0.079), ('visited', 0.078), ('betting', 0.075), ('dice', 0.075), ('ri', 0.073), ('walktree', 0.072), ('em', 0.072), ('nash', 0.069), ('histories', 0.067), ('regrets', 0.065), ('zinkevich', 0.065), ('ui', 0.063), ('strategy', 0.06), ('cumulative', 0.057), ('vanilla', 0.055), ('nodes', 0.054), ('hold', 0.054), ('equilibrium', 0.05), ('exploitable', 0.048), ('mi', 0.045), ('alberta', 0.044), ('bid', 0.043), ('sampling', 0.043), ('round', 0.041), ('bi', 0.039), ('duane', 0.039), ('opponents', 0.039), ('cards', 0.039), ('chips', 0.039), ('recurse', 0.036), ('tighter', 0.034), ('terminal', 0.034), ('history', 0.033), ('es', 0.032), ('chip', 0.032), ('sampled', 0.031), ('ii', 0.031), ('st', 0.03), ('average', 0.03), ('gibson', 0.029), ('qj', 0.029), ('vi', 0.027), ('strategies', 0.027), ('qk', 0.026), ('qi', 0.026), ('faster', 0.025), ('play', 0.025), ('stacks', 0.024), ('marc', 0.024), ('dealings', 0.024), ('szafron', 0.024), ('firstly', 0.024), ('raise', 0.023), ('private', 0.023), ('loses', 0.023), ('michael', 0.023), ('tree', 0.023), ('minimization', 0.022), ('response', 0.022), ('portion', 0.021), ('multiplayer', 0.021), ('burch', 0.021), ('carmelo', 0.021), ('dealt', 0.021), ('gilpin', 0.021), ('johanson', 0.021), ('waugh', 0.021), ('ve', 0.021), ('carlo', 0.021), ('abstraction', 0.021), ('ha', 0.021), ('extensive', 0.021), ('plays', 0.02), ('theorem', 0.02), ('monte', 0.02), ('equilibria', 0.02), ('minz', 0.02), ('bonus', 0.02), ('bet', 0.02), ('eleven', 0.02), ('tuomas', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

2 0.26929462 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

3 0.10968255 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

Author: Uri Maoz, Shengxuan Ye, Ian Ross, Adam Mamelak, Christof Koch

Abstract: The ability to predict action content from neural signals in real time before the action occurs has been long sought in the neuroscientific study of decision-making, agency and volition. On-line real-time (ORT) prediction is important for understanding the relation between neural correlates of decision-making and conscious, voluntary action as well as for brain-machine interfaces. Here, epilepsy patients, implanted with intracranial depth microelectrodes or subdural grid electrodes for clinical purposes, participated in a “matching-pennies” game against an opponent. In each trial, subjects were given a 5 s countdown, after which they had to raise their left or right hand immediately as the “go” signal appeared on a computer screen. They won a fixed amount of money if they raised a different hand than their opponent and lost that amount otherwise. The question we here studied was the extent to which neural precursors of the subjects’ decisions can be detected in intracranial local field potentials (LFP) prior to the onset of the action. We found that combined low-frequency (0.1–5 Hz) LFP signals from 10 electrodes were predictive of the intended left-/right-hand movements before the onset of the go signal. Our ORT system predicted which hand the patient would raise 0.5 s before the go signal with 68±3% accuracy in two patients. Based on these results, we constructed an ORT system that tracked up to 30 electrodes simultaneously, and tested it on retrospective data from 7 patients. On average, we could predict the correct hand choice in 83% of the trials, which rose to 92% if we let the system drop 3/10 of the trials on which it was less confident. Our system demonstrates— for the first time—the feasibility of accurately predicting a binary action on single trials in real time for patients with intracranial recordings, well before the action occurs. 1 1

4 0.10227117 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

5 0.098694146 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

6 0.088897631 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

7 0.088073879 313 nips-2012-Sketch-Based Linear Value Function Approximation

8 0.083363302 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

9 0.08177951 165 nips-2012-Iterative ranking from pair-wise comparisons

10 0.073814429 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

11 0.07376381 293 nips-2012-Relax and Randomize : From Value to Algorithms

12 0.070026457 102 nips-2012-Distributed Non-Stochastic Experts

13 0.069859006 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

14 0.06675294 344 nips-2012-Timely Object Recognition

15 0.066690467 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

16 0.063132338 304 nips-2012-Selecting Diverse Features via Spectral Regularization

17 0.063098326 194 nips-2012-Learning to Discover Social Circles in Ego Networks

18 0.060742982 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

19 0.05643677 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

20 0.053083681 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.115), (1, -0.134), (2, 0.019), (3, -0.007), (4, -0.0), (5, -0.007), (6, -0.011), (7, 0.019), (8, -0.031), (9, 0.055), (10, 0.035), (11, 0.033), (12, -0.08), (13, -0.049), (14, -0.079), (15, 0.068), (16, -0.039), (17, -0.008), (18, -0.069), (19, -0.059), (20, -0.001), (21, -0.004), (22, -0.01), (23, -0.039), (24, 0.105), (25, 0.013), (26, 0.029), (27, 0.02), (28, 0.017), (29, 0.15), (30, -0.073), (31, -0.049), (32, -0.133), (33, 0.116), (34, -0.05), (35, 0.083), (36, 0.004), (37, 0.038), (38, -0.058), (39, 0.105), (40, 0.073), (41, -0.124), (42, -0.009), (43, -0.095), (44, -0.181), (45, 0.054), (46, -0.074), (47, -0.105), (48, 0.065), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96086329 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

2 0.58993351 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

3 0.57208604 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

4 0.52029479 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

5 0.50612628 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

6 0.47288388 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

7 0.46074545 295 nips-2012-Risk-Aversion in Multi-armed Bandits

8 0.44534111 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

9 0.44430298 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

10 0.44190165 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

11 0.42881697 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

12 0.40682113 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

13 0.3979049 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

14 0.34183139 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

15 0.3412044 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

16 0.33533403 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

17 0.32223293 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

18 0.32044914 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.30843708 161 nips-2012-Interpreting prediction markets: a stochastic approach

20 0.30659521 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (21, 0.011), (27, 0.275), (38, 0.086), (42, 0.019), (45, 0.016), (46, 0.014), (53, 0.017), (54, 0.052), (55, 0.017), (59, 0.064), (69, 0.012), (74, 0.033), (76, 0.106), (80, 0.076), (92, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81725836 39 nips-2012-Analog readout for optical reservoir computers

Author: Anteo Smerieri, François Duport, Yvon Paquot, Benjamin Schrauwen, Marc Haelterman, Serge Massar

Abstract: Reservoir computing is a new, powerful and flexible machine learning technique that is easily implemented in hardware. Recently, by using a timemultiplexed architecture, hardware reservoir computers have reached performance comparable to digital implementations. Operating speeds allowing for real time information operation have been reached using optoelectronic systems. At present the main performance bottleneck is the readout layer which uses slow, digital postprocessing. We have designed an analog readout suitable for time-multiplexed optoelectronic reservoir computers, capable of working in real time. The readout has been built and tested experimentally on a standard benchmark task. Its performance is better than non-reservoir methods, with ample room for further improvement. The present work thereby overcomes one of the major limitations for the future development of hardware reservoir computers. 1

same-paper 2 0.78270531 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

3 0.68014848 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

4 0.62695324 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

5 0.61374182 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1

6 0.56273335 348 nips-2012-Tractable Objectives for Robust Policy Optimization

7 0.54688644 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

8 0.53151315 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

9 0.52652144 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

10 0.52552408 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

11 0.52427089 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

12 0.52321136 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.52308172 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

14 0.52085942 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

15 0.52014917 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.51790929 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

17 0.51748055 294 nips-2012-Repulsive Mixtures

18 0.51725912 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

19 0.51577544 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

20 0.51438123 65 nips-2012-Cardinality Restricted Boltzmann Machines