nips nips2012 nips2012-297 knowledge-graph by maker-knowledge-mining

297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Source: pdf

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Robustness and risk-sensitivity in Markov decision processes Takayuki Osogami IBM Research - Tokyo 5-6-52 Toyosu, Koto-ku, Tokyo, Japan osogami@jp. [sent-1, score-0.095]

2 com Abstract We uncover relations between robust MDPs and risk-sensitive MDPs. [sent-3, score-0.17]

3 The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. [sent-4, score-0.437]

4 The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. [sent-5, score-0.505]

5 We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. [sent-6, score-0.554]

6 1 Introduction Robustness against uncertainties and sensitivity to risk are major issues that have been addressed in recent development of the Markov decision process (MDP). [sent-8, score-0.297]

7 The robust MDP [3, 4, 10, 11, 12, 20, 21] deals with uncertainties in parameters; that is, some of the parameters of the MDP are not known exactly. [sent-9, score-0.178]

8 The objective of a robust MDP is to minimize a function for the worst case when the values of its parameters vary within a predeﬁned set called an uncertainty set. [sent-10, score-0.375]

9 The standard objective function is the expected cumulative cost [11]. [sent-11, score-0.284]

10 When the uncertainty set is trivial, the robust MDP is reduced to the standard MDP [17]. [sent-12, score-0.233]

11 The objective of a risk-sensitive MDP is to minimize the value of a risk measure, such as the expected exponential utility [5, 7, 8, 15, 18], of the cumulative cost. [sent-14, score-0.486]

12 When the risk measure is expectation, the risk-sensitive MDP is reduced to the standard MDP. [sent-15, score-0.218]

13 The robust MDP and the risk-sensitive MDP have been developed independently. [sent-16, score-0.133]

14 Such unveiled relations will provide insights into the two models of MDPs. [sent-18, score-0.087]

15 For example, it is not always clear what it means to minimize the value of a risk measure or to minimize the worst case expected cumulative cost under an uncertainty set. [sent-19, score-0.64]

16 In particular, the iterated risk measure studied in [13, 14, 19] is deﬁned recursively, which prevents an intuitive understanding of its meaning. [sent-20, score-0.402]

17 The unveiled relation to a robust MDP can allow us to understand what it means to minimize the value of an iterated risk measure in terms of uncertainties. [sent-21, score-0.642]

18 In addition, the optimal policy for a robust MDP is often found too conservative [3, 4, 10, 21], or it becomes intractable to ﬁnd the optimal policy particularly when the transition probabilities have uncertainties [3, 4, 10]. [sent-22, score-0.499]

19 The unveiled relations to a risk-sensitive MDP, for which the optimal policy can be found efﬁciently, can allow us to ﬁnd the optimal robust policy efﬁciently, avoiding that the policy is too conservative. [sent-23, score-0.596]

20 More speciﬁcally, the iterated risk measure of the risk-sensitive MDP is deﬁned recursively with a class of coherent risk measures [9], and it evaluates the riskiness of the sum of the value of a coherent risk measure of immediate cost. [sent-27, score-1.12]

21 The uncertainty set of the robust MDP is characterized by the use of a representation of the coherent risk measure. [sent-28, score-0.519]

22 More speciﬁcally, the expected exponential utility evaluates the riskiness of the sum of the value of an entropic risk measure [6] of immediate cost. [sent-31, score-0.428]

23 The penalty function measures the deviation of the values of the probability mass functions from their nominal values using the Kullback-Leibler divergence. [sent-32, score-0.246]

24 2 Robust representations of iterated coherent risk measures Throughout this paper, we consider Markov decision processes over a ﬁnite horizon, so that there are N decision epochs. [sent-34, score-0.651]

25 We assume that a nominal transition probability, p0 (s |s, a), is associated with the transition from each state s ∈ Sn to each state s ∈ Sn+1 given that the action a ∈ A(s) is taken at s for n = 0, . [sent-40, score-0.326]

26 For a robust MDP, the corresponding true transition probability, p(s |s, a), has the uncertainty that will be speciﬁed in the sequel. [sent-44, score-0.288]

27 We assume that C(s, a) has a nominal probability distribution, but the true probability distribution for a robust MDP has the uncertainty that will be speciﬁed in the sequel. [sent-46, score-0.4]

28 1 Special case of the iterated conditional tail expectation We start by studying a robust MDP where the uncertainty is speciﬁed by the factor, α, such that 0 < α < 1, which determines the possible deviation from the nominal value. [sent-51, score-0.606]

29 Speciﬁcally, for each pair of s ∈ Sn and a ∈ A(s), the true transition probabilities are in the following uncertainty set: 1 0 ≤ p(s |s, a) ≤ p0 (s |s, a), ∀s ∈ Sn+1 and p(s |s, a) = 1. [sent-52, score-0.155]

30 1, we assume that the cost C(s, a) is deterministic and has no uncertainty. [sent-54, score-0.104]

31 Our key ﬁnding is that there is a risk-sensitive MDP that is equivalent to the robust MDP having the objective (2). [sent-56, score-0.239]

32 Speciﬁcally, consider the risk-sensitive MDP, where the transition probability is given by p0 , and the cost C(s, a) is deterministic given s and a. [sent-57, score-0.18]

33 In (4), denotes the ICTE of C(π) (N −i) ˜ conditioned on the state at the i-th decision epoch. [sent-63, score-0.108]

34 CTE is also α known as conditional value at risk or average value at risk and is formally deﬁned as follows for a random variable Y : (1 − β)E[Y |Y > Vα ] + (β − α)Vα CTEα [Y ] ≡ , (6) 1−α where Vα ≡ min{y | FY (y) ≥ α}, and FY is the cumulative distribution function of Y . [sent-66, score-0.475]

35 For a continuous Y , or unless there is a mass probability at Vα , we have CTEα [Y ] = E[Y |Y > Vα ]. [sent-67, score-0.09]

36 Speciﬁcally, let Ci be π π ˜ the cost incurred at the i-th epoch with policy π so that C(π) = C0 + · · · + CN −1 . [sent-69, score-0.203]

37 What (9) suggests is that the ICTE of the cumulative π π cost given Si can be represented by the cost already accumulated C0 +· · ·+Ci−1 plus the maximum possible expected value of the sum of the cost incurred at Si and the ICTE of the cumulative cost to be incurred from the (i + 1)st epoch. [sent-71, score-0.593]

38 When the immediate cost from a state is deterministic given that state and the action from that state, the risk-sensitive MDP with the objective (3) is equivalent to the robust MDP with the objective (2). [sent-74, score-0.537]

39 Throughout, we say that a risk-sensitive MDP is equivalent to a robust MDP if the two MDPs have a common state space, and, regardless of the values of the parameters of the MDPs, the optimal action for one MDP coincides with that for the other for every state. [sent-75, score-0.253]

40 2 Relation between cost uncertainty and risk-sensitivity In addition to the transition probabilities, we now assume that the probability distribution of cost has uncertainty. [sent-77, score-0.314]

41 Because the uncertainty sets, (1) and (10), are both convex, the existing technique [11] can still be used to efﬁciently ﬁnd the optimal policy with respect to ˜ min max Ep,f [C(π)], (11) π 1 2 p∈Up ,f ∈Uf CTEα [Y + b] = CTEα [Y ] + b for a random Y and a deterministic b. [sent-79, score-0.362]

42 Again, our key ﬁnding is that there is a risk-sensitive MDP that is equivalent to the robust MDP having the objective (11). [sent-83, score-0.239]

43 To deﬁne the objective of the equivalent risk-sensitive MDP, let D(s, a) ≡ ˜ CTEα [C(s, a)] and let D(π) be the cumulative value of D(s, a) along the sequence of (s, a) with a policy π. [sent-84, score-0.337]

44 Then the objective of the equivalent risk-sensitive MDP is given by ˜ min ICTE(N ) D(π) . [sent-85, score-0.144]

45 The risk-sensitive MDP with the objective (12) is equivalent to the robust MDP with the objective (11). [sent-88, score-0.306]

46 3 General case of coherent risk measures The robust MDPs considered in Section 2. [sent-90, score-0.43]

47 We now introduce a broader class of robust MDPs and equivalent risk-sensitive MDPs. [sent-93, score-0.172]

48 To deﬁne the broader class of robust MDPs, we study the uncertainty set of (1) and (10) in more detail. [sent-94, score-0.233]

49 Given a random variable that takes value vi with nominal probability pi for i = 1, . [sent-95, score-0.166]

50 , m, a step of ﬁnding the optimal robust policy calculates the maximum possible expected value: q1 v1 + · · · + qm vm max q 1 pi , ∀i = 1, . [sent-98, score-0.421]

51 Relaxing the constraints in (13), we obtain the following optimization problem, whose optimal solution is still given by (14): max q q1 v1 + · · · + qm vm i i q ≤g s. [sent-115, score-0.131]

52 Otherwise (speciﬁcally, 1−pm < α ≤ 1), the uncertainty set can become too small as qi ≤ pi /α, ∀i. [sent-127, score-0.149]

53 With an appropriate g, we can consider a sufﬁciently large uncertainty set for the pessimistic cases (e. [sent-130, score-0.1]

54 To formally deﬁne the uncertainty set for p(s |s, a), s ∈ Sn and a ∈ A(s), with the concave g, let Qp/p0 (·) denote the quantile function of a random variable that takes value p(s |s, a)/p0 (s |s, a) with probability p0 (s |s, a) for s ∈ Sn+1 . [sent-135, score-0.175]

55 Then p(s |s, a) and f (x|s, a) are in the uncertainty set iff we have, for 0 < t < 1, that 1 1 Qf /f0 (u) du ≤ g(t). [sent-137, score-0.1]

56 Qp/p0 (u) du ≤ g(t) and (16) 1−t 1−t Now (7) suggests that expectation with respect to the q illustrated in Figure 1(a) is the CTE with parameter α with respect to the corresponding p. [sent-138, score-0.118]

57 It can be shown that the expectation with respect to the q illustrated in Figure 1(b) is a coherent risk measure, CRM, of the following form [9]: 1 CRMH [Y ] CTEα [Y ] dH(α), = (17) 0 for a nondecreasing function H such that H(0) = 0 and H(1) = 1, where Y denotes a generic random variable. [sent-139, score-0.357]

58 ˜ Let K(s, a) ≡ CRMH [C(s, a)] and let K(π) be the cumulative value of K(s, a) along the sequence ˜ of (s, a) with a policy, π. [sent-142, score-0.121]

59 We deﬁne an iterated coherent risk measure (ICRM) of K(π) as follows: (N −i+1) ICRMH ˜ K(π) (N −i) ≡ CRMH ICRMH ˜ K(π)|Si , for i = 1, . [sent-143, score-0.491]

60 (19) This risk-sensitive MDP is equivalent to the robust MDP with the objective (11) if dg(t) = dt 1 t 1 dH(s) for s 0 < t < 1. [sent-150, score-0.263]

61 The expectation with respect to the q illustrated in Figure 1(c) can be represented by r1 CRMα1 [·] + r2 CRMα2 [·] with respect to the corresponding p. [sent-152, score-0.118]

62 Notice that Bellman’s optimality equations are satisﬁed both for the robust MDP and for the risk-sensitive MDP under consideration. [sent-156, score-0.193]

63 For the robust MDP, Bellman’s optimality equations are established in [11]. [sent-157, score-0.193]

64 For our risk-sensitive MDP, note that the coherent risk measure satisﬁes strong monotonicity, translation-invariance, and positive homogeneity that are used to establish Bellman’s optimality equations in [13]. [sent-158, score-0.389]

65 A difference between the risk-sensitive MDP in [13] ˜ and our risk-sensitive MDP is that the former minimizes the value of an iterated risk measure for C, (0) ˜ while the latter minimizes the value of an iterated risk measure (speciﬁcally, ICRMH ) for K. [sent-159, score-0.804]

66 5 The equivalence between our risk-sensitive MDP and our robust MDP can thus be established by showing that the two sets of Bellman’s optimality equations are equivalent. [sent-161, score-0.218]

67 For s ∈ Sn , Bellman’s optimality equation for our robust MDP is   v(s) = min max a∈A(s) p∈Up ,f ∈Uf v(s ) p(s |s, a) , x f (x|s, a) +  (21) s ∈Sn+1 x∈X (s,a) where v(s) denotes the value function representing the worst-case expected cumulative cost from s. [sent-162, score-0.456]

68 By (30) and (31), we have v(s) = min (CRMH [C0 (s, a)] + CRMH [V (s, a)]) , (32) a∈A(s) where the ﬁrst CTEH is with respect to f0 (·|s, a); the second is with respect to p0 (·|s, a). [sent-186, score-0.092]

69 Because f0 (·|s, a) is independent of the state at n + 1, the translation invariance3 of CRMH implies v(s) = min CRMH [CRMH [C0 (s, a)] + V (s, a)], (33) a∈A(s) where the inner CRMH is with respect to f0 (·|s, a); the outer is with respect to p0 (·|s, a). [sent-187, score-0.175]

70 3 Robust representations of expected exponential utilities In this section, we study risk-sensitive MDPs whose objectives are deﬁned with expected exponential utilities. [sent-189, score-0.098]

71 We will see that there are robust MDPs that are equivalent to these risk-sensitive MDPs. [sent-190, score-0.172]

72 We start by the standard risk-sensitive MDP [5, 7, 8, 15, 18] whose objective is to minimize ˜ ˜ E[exp(γ C(π))] for γ > 0. [sent-191, score-0.097]

73 Because γ > 0, minimizing E[exp(γ C(π))] is equivalent to mini1 ˜ ˜ mizing an entropic risk measure (ERM) [6, 13]: ERMγ [C(π)] ≡ γ ln E[exp(γ C(π))]. [sent-192, score-0.323]

74 ˜ It is now evident that the risk-sensitive MDP with the objective of minimizing E[exp(γ C(π))] is ˜ equivalent to a “robust” MDP with the objective of minimizing Eq [C(π)] − γ KL(q||q0 ) for the ˜ worst choice of q ∈ P(q0 ), where q0 denotes the probability mass function for C(π). [sent-199, score-0.39]

75 Here, the uncertainty is in the distribution of the cumulative cost, and it is nontrivial how this uncertainty is related to the uncertainty in the parameters, p and f , of the MDP. [sent-200, score-0.421]

76 ˜ Our goal is to explicitly relate the risk-sensitive MDP of minimizing E[exp(γ C(π))] to uncertainties in the parameters of the MDP. [sent-201, score-0.086]

77 For a moment, we assume that C(s, a) has no uncertainty and is deterministic given s and a, which will be relaxed later. [sent-202, score-0.135]

78 Let pπ (si+1 |si ) be the nominal transition 0 probability from si ∈ Si to si+1 ∈ Si+1 for i = 0, . [sent-204, score-0.343]

79 By the translation invariance and the recursiveness4 of ERM [13], we have N −1 ˜ ERMγ [C(π)] π π = C0 + ERMγ C1 + ERMγ π Ci |S1 , (35) i=2 where the inner ERM is with respect to pπ (·|S1 ); the outer is with respect to pπ (·|s0 ). [sent-208, score-0.104]

80 Let f0 (·|s) be the nominal probability mass function for the immediate cost from a state s under a policy π. [sent-219, score-0.463]

81 Consider the risk-sensitive MDP with the following objective: ˜ min ERMγ L(π) , π (38) ˜ where L(π) is the cumulative value of L(s, a) ≡ ERMγ [C(s, a)] along the sequence of (s, a) with a policy π. [sent-220, score-0.269]

82 The risk-sensitive MDP with the objective (38) is equivalent to the robust MDP with π the following objective, where f π ∈ P(f0 ) is deﬁned analogously to pπ ∈ P(pπ ) in Theorem 4: 0    Epπ ,f π N −1 C π  i i=0 min πmaxπ . [sent-223, score-0.315]

83 (39) N −2 N −1 π π π π π p ∈P(p0 ) −γ E π  p i=0 KL (p (·|Si )||p0 (·|Si )) + i=0 KL (f (·|Si )||f0 (·|Si )) π π f ∈P(f0 ) 4 Conclusion We have shown relations between risk-sensitive MDPs and robust MDPs. [sent-224, score-0.17]

84 Because ERM is also an iterated risk measure [13], the objectives of the risk-sensitive MDPs studied in this paper are all with respect to some iterated risk measures. [sent-225, score-0.79]

85 The signiﬁcance of iterated risk measures is intensively discussed in [13], but it can represent one’s preferences that cannot be represented by standard expected exponential utility and yet allows efﬁcient optimization and consistent decision making. [sent-226, score-0.558]

86 ˜ While the prior work [13, 14, 19] minimizes the iterated risk measure of the cumulative cost (C(π) in Section 2), our study on the relation to a robust MDP suggests that one might want to minimize ˜ the iterated risk measure of the sum of the values of risk measures for immediate costs (e. [sent-227, score-1.428]

87 3 or L(π) in Section 3), because the latter is related to the robustness against uncertainty in cost. [sent-230, score-0.126]

88 This means that the optimal policy for the robust MDP studied in this paper can be found quite efﬁciently. [sent-232, score-0.266]

89 In particular, the robust MDP in Theorem 5 might not seem to allow an efﬁcient optimization without the knowledge of the relation to the corresponding risk-sensitive MDP, whose optimal policy is readily available with dynamic programming. [sent-233, score-0.313]

90 Overall, the relation to a robust MDP can provide strong motivation for the corresponding risk-sensitive MDP and vice versa. [sent-234, score-0.16]

91 For simplicity, the uncertainty sets in Section 2 are characterized by a single parameter, α, or a single function, g, but it is trivial to extend our results to the cases where the uncertainty sets are deﬁned differently depending on the particular states, actions, and other elements of the MDP. [sent-235, score-0.22]

92 In such cases, the objective of the corresponding risk-sensitive MDP is composed of various risk measures. [sent-236, score-0.244]

93 The uncertainty set in Section 3 depends only on the support of the nominal probability mass function. [sent-237, score-0.315]

94 The penalty for the deviation from the nominal value can be adjusted with a single parameter, γ, but it is also trivial to extend our results to the cases, where this parameter varies depending on the particular elements of the MDP. [sent-238, score-0.125]

95 In such cases, the objective of the corresponding risk-sensitive MDP is an iterated risk measure composed of ERM having varying parameters. [sent-239, score-0.469]

96 It would also be an interesting direction to extend our results to convex risk measures, which allows robust representations. [sent-240, score-0.31]

97 Robust control of Markov decision processes with uncertain transition matrices. [sent-304, score-0.193]

98 Robustness in Markov decision problems with uncertain transition matrices. [sent-310, score-0.173]

99 Iterated risk measures for risk-sensitive Markov decision processes with discounted cost. [sent-318, score-0.303]

100 On terminating Markov decision processes with a risk-averse objective function. [sent-328, score-0.162]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mdp', 0.58), ('icte', 0.316), ('crmh', 0.3), ('cte', 0.283), ('erm', 0.216), ('iterated', 0.184), ('risk', 0.177), ('si', 0.142), ('uf', 0.133), ('robust', 0.133), ('nominal', 0.125), ('cumulative', 0.121), ('sn', 0.119), ('policy', 0.11), ('uncertainty', 0.1), ('mdps', 0.09), ('coherent', 0.089), ('ri', 0.085), ('icrmh', 0.083), ('dh', 0.081), ('decision', 0.075), ('cost', 0.069), ('mass', 0.069), ('objective', 0.067), ('bellman', 0.061), ('transition', 0.055), ('qm', 0.054), ('ci', 0.053), ('osogami', 0.05), ('unveiled', 0.05), ('worst', 0.045), ('uncertainties', 0.045), ('uncertain', 0.043), ('utility', 0.042), ('measure', 0.041), ('minimizing', 0.041), ('crm', 0.041), ('expectation', 0.041), ('equivalent', 0.039), ('ep', 0.039), ('optimality', 0.039), ('min', 0.038), ('kl', 0.038), ('analogously', 0.038), ('relations', 0.037), ('immediate', 0.036), ('deterministic', 0.035), ('markov', 0.034), ('state', 0.033), ('riskiness', 0.033), ('concave', 0.032), ('eq', 0.032), ('measures', 0.031), ('dg', 0.03), ('tokyo', 0.03), ('minimize', 0.03), ('hoboken', 0.029), ('fy', 0.029), ('max', 0.029), ('qi', 0.029), ('relation', 0.027), ('expected', 0.027), ('respect', 0.027), ('nilim', 0.027), ('equality', 0.027), ('robustness', 0.026), ('outer', 0.026), ('cally', 0.026), ('evaluates', 0.025), ('delage', 0.025), ('entropic', 0.025), ('equivalence', 0.025), ('action', 0.025), ('vm', 0.025), ('theorem', 0.024), ('exchanging', 0.024), ('incurred', 0.024), ('dt', 0.024), ('translation', 0.024), ('qf', 0.023), ('optimal', 0.023), ('tail', 0.023), ('illustrated', 0.023), ('editors', 0.022), ('induction', 0.022), ('quantile', 0.022), ('maximizer', 0.022), ('establish', 0.022), ('exponential', 0.022), ('percentile', 0.022), ('equations', 0.021), ('programming', 0.021), ('probability', 0.021), ('processes', 0.02), ('recursively', 0.02), ('dynamic', 0.02), ('characterized', 0.02), ('pi', 0.02), ('edition', 0.02), ('optimistic', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

2 0.30997759 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efﬁciently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efﬁciently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

3 0.16336142 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efﬁcient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

4 0.1389816 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

5 0.1296289 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch ﬂower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: ﬁrst predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming ﬁxed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

6 0.12596634 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.1242004 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.10410042 271 nips-2012-Pointwise Tracking the Optimal Regression Function

9 0.10114289 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

10 0.10014346 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

11 0.098806739 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

12 0.097190753 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

13 0.091284126 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

14 0.085464075 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

15 0.081887521 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

16 0.081508353 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

17 0.081488729 292 nips-2012-Regularized Off-Policy TD-Learning

18 0.079854392 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

19 0.074637882 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

20 0.074378513 16 nips-2012-A Polynomial-time Form of Robust Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.145), (1, -0.252), (2, 0.028), (3, -0.057), (4, -0.014), (5, 0.018), (6, -0.004), (7, 0.042), (8, -0.0), (9, -0.005), (10, 0.028), (11, 0.004), (12, -0.045), (13, 0.005), (14, -0.011), (15, -0.09), (16, 0.016), (17, -0.017), (18, 0.036), (19, 0.015), (20, -0.036), (21, 0.007), (22, 0.052), (23, 0.064), (24, -0.01), (25, -0.014), (26, 0.074), (27, 0.082), (28, -0.098), (29, 0.174), (30, -0.045), (31, -0.071), (32, -0.2), (33, 0.069), (34, 0.019), (35, 0.077), (36, -0.101), (37, 0.139), (38, 0.089), (39, 0.054), (40, -0.058), (41, 0.082), (42, 0.052), (43, -0.076), (44, -0.044), (45, -0.045), (46, -0.096), (47, 0.013), (48, -0.11), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97111851 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

2 0.83819854 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

3 0.82913303 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

4 0.73177516 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

5 0.65807289 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

Author: Doina Precup, Joelle Pineau, Andre S. Barreto

Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1

6 0.5884738 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

7 0.58413428 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

8 0.54765403 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

9 0.54696107 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

10 0.54672199 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

11 0.54179043 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

12 0.48545188 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

13 0.46361384 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

14 0.43955794 38 nips-2012-Algorithms for Learning Markov Field Policies

15 0.4387385 358 nips-2012-Value Pursuit Iteration

16 0.43723729 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.38314936 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

18 0.37601992 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

19 0.3726503 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

20 0.36964712 161 nips-2012-Interpreting prediction markets: a stochastic approach

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.014), (21, 0.021), (35, 0.01), (38, 0.121), (39, 0.014), (42, 0.035), (54, 0.046), (55, 0.017), (71, 0.294), (74, 0.041), (76, 0.134), (80, 0.096), (92, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77677613 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

2 0.71137673 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Speciﬁcally, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efﬁcient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efﬁciently implemented and deployed in tasks where running efﬁciency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1

3 0.67643815 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

Abstract: In this paper, we derive a novel algorithm to cluster hidden Markov models (HMMs) according to their probability distributions. We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We illustrate the beneﬁts of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging. 1

4 0.6580137 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-ﬂats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-ﬂats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-ﬂats, both the results and the mathematical tools are new. 1

5 0.62230015 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

6 0.59703869 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

7 0.59449893 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

8 0.59396547 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.59393299 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.59355921 38 nips-2012-Algorithms for Learning Markov Field Policies

11 0.59305406 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

12 0.59213424 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

13 0.59183747 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

14 0.59041029 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15 0.58973467 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

16 0.58947796 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

17 0.5890224 197 nips-2012-Learning with Recursive Perceptual Representations

18 0.58848321 160 nips-2012-Imitation Learning by Coaching

19 0.58847767 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

20 0.58781189 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models