nips nips2012 nips2012-348 knowledge-graph by maker-knowledge-mining

348 nips-2012-Tractable Objectives for Robust Policy Optimization


Source: pdf

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. [sent-4, score-0.375]

2 When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. [sent-5, score-0.634]

3 We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. [sent-9, score-1.048]

4 We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. [sent-10, score-0.355]

5 Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. [sent-11, score-0.446]

6 , rewards and transition probabilities) learned from data, and then find an optimal policy: a sequence of actions that would maximize expected cumulative reward in that MDP. [sent-15, score-0.414]

7 However, optimal policies are sensitive to the estimated reward and transition parameters. [sent-16, score-0.448]

8 The policy that maximizes expected utility under a single estimated model, or even averaged over a distribution of models, may still result in poor outcomes for a substantial minority of patients. [sent-20, score-0.436]

9 What is called for is a policy that is more robust to the uncertainties of individual patients. [sent-21, score-0.392]

10 There are two main approaches for finding robust policies in MDPs with parameter uncertainty. [sent-22, score-0.319]

11 The first assumes rewards and transitions belong to a known and compact uncertainty set, which also includes a single nominal parameter setting that is thought most likely to occur [19]. [sent-23, score-0.331]

12 Robustness, in this context, is a policy’s performance under worst-case parameter realizations from the set and is something one must trade-off against how well a policy performs under the nominal parameters. [sent-24, score-0.428]

13 In many cases, the robust policies found are overly conservative because they do not take into account how likely it is for an agent to encounter worst-case parameters. [sent-25, score-0.293]

14 The second approach takes a Bayesian perspective on parameter uncertainty, where a prior distribution over the parameter values is assumed to be given, with a goal to optimize the performance for a particular percentile [4]. [sent-26, score-0.595]

15 In fact, percentile optimization with general parameter uncertainty is NP-hard [3]. [sent-30, score-0.777]

16 We introduce a generalization of percentile optimization with objectives defined by a measure over percentiles instead of a single percentile. [sent-33, score-0.922]

17 This family of objectives subsumes tractable objectives such as optimizing for expected value, worst-case, or Conditional Value-at-Risk; as well as intractable objectives such as optimizing for a single specific percentile (percentile optimization or Value-atRisk). [sent-34, score-1.062]

18 We then introduce a particular family of percentile measures, which can be efficiently approximated. [sent-35, score-0.575]

19 We show this by framing the problem as a two-player, zero-sum, extensive-form game, and then employing a form of counterfactual regret minimization to find near-optimal policies in time polynomial in the number of states and actions in the MDP. [sent-36, score-0.481]

20 We give a further generalization of this family by proving a general, but sufficient, condition under which percentile measures admit efficient optimization. [sent-37, score-0.68]

21 Finally, we empirically demonstrate our algorithm on a synthetic uncertain MDP setting inspired by finding robust policies for diabetes management. [sent-38, score-0.401]

22 In section 3, we show that many of the objectives described here are special cases of percentile measures. [sent-40, score-0.656]

23 For a fixed MDP M, the objective is to compute a policy π that maximizes expected cumulative reward, π VM = E ￿ H ￿ t=0 ￿ ￿ ￿ ￿ R(st , at )￿M, s0 ∝ P (s0 ), π ￿ (1) For a fixed MDP, the set of Markov random policies (in fact, Markov deterministic policies) contains π a maximizing policy. [sent-54, score-0.663]

24 This is called the optimal policy for the fixed MDP: π ∗ = argmaxπ∈ΠM R VM . [sent-55, score-0.332]

25 The form of this uncertainty and associated optimization objectives has been the topic of a number of papers. [sent-61, score-0.321]

26 One formulation for parameter uncertainty assumes that the parameters are taken from uncertainty sets R ∈ R and P ∈ P [12]. [sent-63, score-0.356]

27 In the robust MDP approach the desired policy maximizes performance in the worst-case parameters of the uncertainty sets: π ∗ = argmax π min R∈R,P ∈P 2 π VM (2) The robust MDP objective has been criticized for being overly-conservative as it focuses entirely on the worst-case [19]. [sent-64, score-0.688]

28 Xu and Mannor [20] propose a further alternative by placing parameter realizations into nested uncertainty sets, each associated with a probability of drawing a parameter realization from the set. [sent-68, score-0.274]

29 They then propose a distributional robustness approach, which maximizes the expected performance over the worst-case distribution of parameters that satisfies the probability bounds on uncertainty sets. [sent-69, score-0.275]

30 A natural alternative is to look at percentile optimization [4]. [sent-76, score-0.586]

31 For a fixed η, the objective is to seek a policy that will maximize the performance on η percent of parameter realizations. [sent-77, score-0.358]

32 Formally, this results in the following optimization: π ∗ = argmax max π y y∈R π subject to PM [VM ≥ y] ≥ η (3) The optimal policy π ∗ guarantees the optimal value y ∗ is achieved with probability η given the distribution over parameters P(R, P ). [sent-78, score-0.361]

33 Delage and Mannor showed that for general reward and/or transition uncertainty, percentile optimization is NP-hard (even for a small fixed horizon) [3]. [sent-79, score-0.801]

34 , given enough observations), optimizing an approximation of the expected performance over the parameters approximately optimizes for percentile performance [4]. [sent-83, score-0.593]

35 Value-at-Risk is equivalent to percentile optimization and is intractable for general forms of parameter uncertainty. [sent-86, score-0.612]

36 In section 3 we show that CVaR is also encompassed by percentile measures. [sent-90, score-0.543]

37 A common requirement, for example, is that the uncertainty between states is uncoupled or independent; or that reward and transition uncertainty themselves are uncoupled or independent. [sent-93, score-0.661]

38 The Delage and Mannor work on percentile optimization [4] makes the more natural assumption that the uncertain parameters are stationary, but in turn requires very specific choices for the uncertainty distributions themselves. [sent-98, score-0.803]

39 5 Percentile 1 (c) k of N Figure 1: Examples of percentile measures. [sent-117, score-0.543]

40 We begin by delineating a family of objectives for robust policy optimization, which generalizes the concept of percentile optimization. [sent-119, score-1.08]

41 While percentile optimization is already known to be NP-hard, in section 4, we will restrict our focus to a subclass of our family that does admit efficient algorithms. [sent-120, score-0.674]

42 Rather than seeking to maximize one specific percentile of MDPs, our family of objectives maximizes an integral of a policy’s performance over all percentiles η ∈ [0, 1] of MDPs M as weighted by a percentile measure µ. [sent-121, score-1.496]

43 In fact, our distribution measures framework encompasses optimization objectives for the expected, robust, and percentile MDP problems as well as for VaR and CVaR. [sent-124, score-0.778]

44 , a uniform density over the unit interval), all percentiles are equally weighted and the µ-robust policy will optimize the expected cumulative π reward over the distribution P(M). [sent-127, score-0.762]

45 [10], where they concluded that the common approach of computing an π optimal policy for the expected MDP, i. [sent-130, score-0.357]

46 1 , where δη is the Dirac delta at η, the optimization problem becomes identical to the VaR and percentile optimization problems where η = 0. [sent-134, score-0.629]

47 The measures for the 10th, 25th, and 40th percentiles are shown in figure 1a. [sent-136, score-0.275]

48 4 k-of-N Measures There is little reason to restrict ourselves to percentile measures that put uniform weight on all percentiles, or Dirac deltas on the worst-case or specific percentiles. [sent-139, score-0.622]

49 One can imagine creating other density functions over percentiles, and not all of these percentile measures will necessarily be intractable like percentile optimization. [sent-140, score-1.213]

50 In this section we introduce a subclass of percentile measures, called k-of-N measures, and go on to show that we can efficiently approximate µ-robust policies for this entire subclass. [sent-141, score-0.806]

51 We start by imagining a sampling scheme for evaluating the robustness of a fixed policy π. [sent-142, score-0.375]

52 For each MDP we can evaluate the policy π and then rank the MDPs based on how much expected cumulative reward π attains on each. [sent-144, score-0.518]

53 If we choose to evaluate our policy based on the very worst of these MDPs, that is, the k = 1 of the N = 1000 MDPs, then we get a loose estimate of the percentile value of π in the neighborhood of the 1/1000th percentile for the distribution P(M). [sent-145, score-1.418]

54 We see that as 4 N increases, the policy puts more weight on optimizing for lower percentiles of MDPs. [sent-149, score-0.553]

55 Thus we can smoothly transition from finding policies that perform well in expectation (no robustness) to policies that care almost only about worst-case performance (overly conservative robustness). [sent-150, score-0.551]

56 The densities themselves act as approximate step-functions whose weight falls off in the neighborhood of the percentile η = k/N . [sent-154, score-0.543]

57 1 k-of-N Game Our sampling description of the k-of-N measure can be reframed as a two-player zero-sum extensive-form game with imperfect information, as shown in Figure 2. [sent-161, score-0.281]

58 Each node in the tree represents a game state or history labeled with the player whose turn it is to act, with each branch being a possible action. [sent-162, score-0.37]

59 In our game formulation, chance, denoted as player c, first selects N MDPs according to P(M). [sent-163, score-0.37]

60 The adversary, denoted as player 2, has only one decision in the game which is to select a subset #$ ! [sent-164, score-0.432]

61 Such histories are partitioned into one set, termed (%)$ &'$ (+$ &*)$ an information set, and the player’s policy must be identical for all Figure 2: k-of-N game tree histories in an information set. [sent-169, score-0.575]

62 The decision maker now alternates turns with chance, observing states sampled by chance according to the chosen MDP’s transition function, but not ever observing the chosen MDP itself, i. [sent-170, score-0.3]

63 , histories with the same sequence of sampled states and chosen actions belong to the same information set for player 1. [sent-172, score-0.367]

64 After the horizon has been reached, the utility of the leaf node is just the sum of the immediate rewards of the decision maker’s actions according to the chosen MDP’s reward function. [sent-173, score-0.434]

65 ￿ ∞ N ￿ " ￿ ￿ N k The decision maker’s behavioral strategy in the game maps information sets of the game to a distribution over actions. [sent-174, score-0.454]

66 Since the only information is the observed state-action sequence, the strategy can be viewed as a policy in ΠHR (or possibly ΠM R , as we will discuss below). [sent-175, score-0.382]

67 Because the k-of-N game is zero-sum, a Nash equilibrium policy in the game is one that maximizes its expected utility against its best-response adversary. [sent-176, score-0.822]

68 The best-response adversary for any policy is the one that chooses the k least favorable MDPs for that policy. [sent-177, score-0.393]

69 Hence, a Nash equilibrium policy for the k-of-N game is a µk-of-N -robust policy. [sent-179, score-0.547]

70 Furthermore, an ￿-Nash equilibrium policy is a 2￿ approximation of a µk-of-N -robust policy. [sent-180, score-0.376]

71 While player one’s strategy is tractable (the size of a policy in the underly5 ing MDP), player two’s strategy involves decisions at infinitely many information sets (one for each sampled set of N MDPs). [sent-188, score-0.86]

72 , T }, with probability (1 − p), player 1’s strategy on iteration T ∗ is part of an ￿-Nash equilibrium with ￿ ￿ ￿ 2H∆|I1 | |A1 | 2 √ ￿= 1+ √ p p T where H∆ is the maximum difference in total reward over H steps, and |I1 | is the number of information sets for player 1. [sent-202, score-0.622]

73 The bestresponse for this subtree of the game involves simply evaluating the player’s current MDP policy on the N MDPs and choosing the least-favorable k. [sent-210, score-0.503]

74 The player’s regrets are then updated using the transitions and rewards for the selected MDP, resulting in a new policy for the next iteration. [sent-212, score-0.429]

75 In finite horizon MDPs with no parameter uncertainty, an optimal policy exists in the space of Markovian policies (ΠM R ) — policies that depend only on the number of timesteps remaining and the current state, but not on the history of past states and actions. [sent-216, score-0.968]

76 The sequence of past states and actions provide information about the uncertain transition parameters, which is informative for future transitions. [sent-218, score-0.295]

77 For this case, optimal policies are not in general Markovian policies as they will depend upon the entire history of states and actions (ΠHR ). [sent-219, score-0.598]

78 , decision points) in an optimal policy is |I1 | = |S|((|S||A|)H − 1)/(|S||A| − 1), and so polynomial in the number of states and actions for any fixed horizon, but exponential in the horizon itself. [sent-222, score-0.618]

79 Under reward uncertainty (where rewards are not observed by the agent while acting), the sequence of past states and actions is not informative, and so Markovian policies again suffice. [sent-228, score-0.753]

80 However, such an information-set structure for the player results in a game with imperfect recall, 1 Markovian policies are also sufficient under a non-stationary uncertainty model, where the transition parameters are resampled independently on repeated visits to states (see the end of Section 2. [sent-231, score-0.992]

81 Therefore, we can construct the extensive-form game with the player restricted to Markovian policies and still solve it with CFR-BR. [sent-237, score-0.603]

82 The total time complexity is O (H∆/￿)2 |I1 | reward uncertainty and |I1 | ∈ O(|S| H+1 3 |A|3 N log N p3 , where |I1 | ∈ O(|S|H) for arbitrary |A| ) for arbitrary transition and reward uncertainty. [sent-243, score-0.51]

83 The choice of T by Theorem 1 guarantees the policy is an ￿/2Nash approximation, which in turn guarantees the policy is within ￿ of optimal in the worst-case, and so is an ￿ approximation to a µk-of-N -robust policy. [sent-246, score-0.664]

84 Each iteration requires N policy evaluations each requiring O(|I1 ||A|) time; these are then sorted in O(N log N ) time; and finally the regret update in O(|I1 ||A|) time. [sent-247, score-0.388]

85 5 Non-Increasing Measures We have defined a family of percentile measures, µk-of-N , that represent optimization objectives that differ in how much weight they place on different percentiles and can be solved efficiently. [sent-249, score-0.927]

86 A µ-robust 1 policy can be approximated with high probability in time polynomial in {|A|, |S|, ∆, L, m, 1 , p } for ￿ (i) arbitrary reward uncertainty with time also polynomial in the horizon or (ii) arbitrary transition and reward uncertainty with a fixed horizon. [sent-254, score-1.129]

87 , worst-case, expectation-maximization, and CVaR) satisfy the property that the weight placed on a particular percentile is never smaller than a larger percentile. [sent-258, score-0.543]

88 Percentile measures (η > 0), though, do not: they place infinitely more weight on the p percentile than any of the percentiles less than η. [sent-260, score-0.818]

89 Our results aim to demonstrate two things: first, that CFR-BR can find k-of-N policies for MDP problems with general uncertainty in rewards and transitions; and second, that optimizing for different percentile measures creates policies that differ accordingly. [sent-263, score-1.345]

90 04 40 5 π VM Density 1−of−1 1−of−5 10−of−50 Comparison to 1−of−1 policy 10 1−of−1 1−of−5 10−of−50 mean MDP Difference Percentile Measure Densities 0. [sent-265, score-0.332]

91 02 −10 0 0 20 40 60 Percentile 80 100 0 0 −5 20 40 60 Percentile 80 100 −15 0 1−of−1 1−of−5 10−of−50 mean MDP 20 40 60 Percentile 80 100 Figure 3: Evaluation of k-of-N percentile measures on the diabetes management task. [sent-267, score-0.709]

92 A good treatment policy keeps blood glucose in the moderate range all day. [sent-271, score-0.469]

93 The Dirichlet parameter vector is the product of a fixed set of per-state parameters with an MDP-wide multiplicative factor q ∼ Unif[1, 5] to simulate variation in patient sensitivity to insulin, and results in transition uncertainty between states that is not independent. [sent-273, score-0.366]

94 We used CFR-BR to find optimal policies for the 1-of-1, 1-of-5, and 10-of-50 percentile measures. [sent-275, score-0.776]

95 We also computed the policy that π optimizes VE(M) , that is the optimal policy for the mean MDP. [sent-277, score-0.664]

96 To highlight the differences between these policies, we show the performance of the policies relative to the 1-of-1-robust policy over the full range of percentiles in Figure 3(right). [sent-279, score-0.761]

97 From the difference plot, we see that the optimal policy for the mean MDP, although optimal for the mean MDP’s specific parameters, does not perform well over the uncertainty distribution (as noted in [10]). [sent-280, score-0.497]

98 Because the 10-of-50 policy has a sharper drop-off in density at the 20th percentile compared to the 1-of-5 policy, we see that 10-of-50 policies give up more performance in higher percentile MDPs for a bit more performance in the lowest 20 percentile MDPs compared to the 1-of-5 policy. [sent-283, score-2.242]

99 7 Conclusion This is the first work we are aware of to do robust policy optimization with general parameter uncertainty. [sent-284, score-0.461]

100 We believe this approach will be useful for adaptive treatment strategy optimization, where small sample sizes cause real parameter uncertainty and the short time horizons make even transition uncertainty tractable. [sent-286, score-0.578]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('percentile', 0.543), ('policy', 0.332), ('mdp', 0.276), ('policies', 0.233), ('mdps', 0.202), ('player', 0.199), ('percentiles', 0.196), ('game', 0.171), ('uncertainty', 0.165), ('reward', 0.13), ('objectives', 0.113), ('nash', 0.098), ('transition', 0.085), ('cfr', 0.084), ('imperfect', 0.083), ('games', 0.082), ('measures', 0.079), ('actions', 0.076), ('cvar', 0.076), ('rewards', 0.067), ('mannor', 0.067), ('shie', 0.065), ('horizon', 0.062), ('decision', 0.062), ('adversary', 0.061), ('johanson', 0.06), ('schizophrenia', 0.06), ('robust', 0.06), ('vm', 0.059), ('treatment', 0.059), ('regret', 0.056), ('diabetes', 0.056), ('states', 0.056), ('markovian', 0.052), ('zinkevich', 0.052), ('maker', 0.052), ('uncertain', 0.052), ('catie', 0.051), ('stroup', 0.051), ('clinical', 0.051), ('strategy', 0.05), ('density', 0.048), ('susan', 0.045), ('burch', 0.045), ('waugh', 0.045), ('glucose', 0.045), ('chance', 0.045), ('equilibrium', 0.044), ('robustness', 0.043), ('nominal', 0.043), ('optimization', 0.043), ('michael', 0.042), ('huan', 0.042), ('alberta', 0.042), ('maximizes', 0.042), ('delage', 0.039), ('bowling', 0.039), ('hr', 0.039), ('utility', 0.037), ('histories', 0.036), ('supplemental', 0.036), ('patient', 0.034), ('baking', 0.034), ('insulin', 0.034), ('meal', 0.034), ('swartz', 0.034), ('blood', 0.033), ('risk', 0.033), ('family', 0.032), ('management', 0.031), ('cumulative', 0.031), ('markov', 0.03), ('transitions', 0.03), ('unacceptable', 0.03), ('counterfactual', 0.03), ('lanctot', 0.03), ('mcevoy', 0.03), ('erick', 0.03), ('distributionally', 0.03), ('uncoupled', 0.03), ('subclass', 0.03), ('realization', 0.03), ('polynomial', 0.03), ('tractable', 0.03), ('argmax', 0.029), ('ats', 0.028), ('horizons', 0.028), ('martin', 0.028), ('var', 0.027), ('realizations', 0.027), ('measure', 0.027), ('kevin', 0.026), ('past', 0.026), ('parameter', 0.026), ('admit', 0.026), ('optimizing', 0.025), ('expected', 0.025), ('coherent', 0.025), ('interval', 0.024), ('strategies', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

2 0.30997759 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

3 0.3045271 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

4 0.27145544 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.26929462 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

6 0.26597294 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

7 0.22018558 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.21629518 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

9 0.21215992 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.16561876 344 nips-2012-Timely Object Recognition

11 0.16442432 160 nips-2012-Imitation Learning by Coaching

12 0.16383962 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

13 0.16268642 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.15915753 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

15 0.15668905 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

16 0.14424184 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

17 0.14356001 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

18 0.13728209 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

19 0.12661156 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

20 0.11964826 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.188), (1, -0.484), (2, -0.012), (3, -0.071), (4, -0.065), (5, 0.025), (6, -0.0), (7, 0.007), (8, 0.016), (9, 0.005), (10, 0.017), (11, -0.01), (12, -0.047), (13, 0.022), (14, -0.011), (15, -0.059), (16, 0.042), (17, -0.005), (18, 0.004), (19, -0.01), (20, -0.04), (21, -0.016), (22, 0.045), (23, 0.011), (24, 0.062), (25, -0.008), (26, 0.024), (27, 0.072), (28, -0.08), (29, 0.149), (30, -0.074), (31, -0.093), (32, -0.19), (33, 0.082), (34, 0.024), (35, 0.108), (36, -0.033), (37, 0.131), (38, 0.031), (39, 0.11), (40, -0.019), (41, 0.012), (42, 0.013), (43, -0.08), (44, -0.098), (45, 0.013), (46, -0.072), (47, 0.018), (48, 0.036), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97092426 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

2 0.84580338 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

3 0.83901799 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

4 0.75895447 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

5 0.7277751 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artificial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always finishes and is asymptotically optimal under sufficient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1

6 0.69967675 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

7 0.68728775 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

8 0.67171824 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.64088488 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

10 0.6376279 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

11 0.62655503 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

12 0.61505044 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

13 0.61335903 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

14 0.6028474 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.59782284 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

16 0.57589513 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.57544982 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

18 0.55280823 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

19 0.54581279 160 nips-2012-Imitation Learning by Coaching

20 0.51842612 51 nips-2012-Bayesian Hierarchical Reinforcement Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (21, 0.022), (27, 0.053), (38, 0.099), (42, 0.045), (53, 0.014), (54, 0.083), (55, 0.031), (59, 0.016), (69, 0.173), (71, 0.034), (74, 0.059), (76, 0.128), (80, 0.08), (92, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82554054 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

2 0.75005955 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

3 0.74438161 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

4 0.74425393 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

5 0.74052984 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

Author: Paul Vernaza, Drew Bagnell

Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1

6 0.73366421 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

7 0.73107457 39 nips-2012-Analog readout for optical reservoir computers

8 0.72707009 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

9 0.72549427 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.72505432 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

11 0.72483754 38 nips-2012-Algorithms for Learning Markov Field Policies

12 0.72114789 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

13 0.71942455 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.71811253 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.71543324 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.71534401 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

17 0.7126531 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

18 0.71265215 160 nips-2012-Imitation Learning by Coaching

19 0.7122187 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

20 0.70952803 198 nips-2012-Learning with Target Prior