nips nips2012 nips2012-297 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Takayuki Osogami
Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Robustness and risk-sensitivity in Markov decision processes Takayuki Osogami IBM Research - Tokyo 5-6-52 Toyosu, Koto-ku, Tokyo, Japan osogami@jp. [sent-1, score-0.095]
2 com Abstract We uncover relations between robust MDPs and risk-sensitive MDPs. [sent-3, score-0.17]
3 The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. [sent-4, score-0.437]
4 The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. [sent-5, score-0.505]
5 We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. [sent-6, score-0.554]
6 1 Introduction Robustness against uncertainties and sensitivity to risk are major issues that have been addressed in recent development of the Markov decision process (MDP). [sent-8, score-0.297]
7 The robust MDP [3, 4, 10, 11, 12, 20, 21] deals with uncertainties in parameters; that is, some of the parameters of the MDP are not known exactly. [sent-9, score-0.178]
8 The objective of a robust MDP is to minimize a function for the worst case when the values of its parameters vary within a predefined set called an uncertainty set. [sent-10, score-0.375]
9 The standard objective function is the expected cumulative cost [11]. [sent-11, score-0.284]
10 When the uncertainty set is trivial, the robust MDP is reduced to the standard MDP [17]. [sent-12, score-0.233]
11 The objective of a risk-sensitive MDP is to minimize the value of a risk measure, such as the expected exponential utility [5, 7, 8, 15, 18], of the cumulative cost. [sent-14, score-0.486]
12 When the risk measure is expectation, the risk-sensitive MDP is reduced to the standard MDP. [sent-15, score-0.218]
13 The robust MDP and the risk-sensitive MDP have been developed independently. [sent-16, score-0.133]
14 Such unveiled relations will provide insights into the two models of MDPs. [sent-18, score-0.087]
15 For example, it is not always clear what it means to minimize the value of a risk measure or to minimize the worst case expected cumulative cost under an uncertainty set. [sent-19, score-0.64]
16 In particular, the iterated risk measure studied in [13, 14, 19] is defined recursively, which prevents an intuitive understanding of its meaning. [sent-20, score-0.402]
17 The unveiled relation to a robust MDP can allow us to understand what it means to minimize the value of an iterated risk measure in terms of uncertainties. [sent-21, score-0.642]
18 In addition, the optimal policy for a robust MDP is often found too conservative [3, 4, 10, 21], or it becomes intractable to find the optimal policy particularly when the transition probabilities have uncertainties [3, 4, 10]. [sent-22, score-0.499]
19 The unveiled relations to a risk-sensitive MDP, for which the optimal policy can be found efficiently, can allow us to find the optimal robust policy efficiently, avoiding that the policy is too conservative. [sent-23, score-0.596]
20 More specifically, the iterated risk measure of the risk-sensitive MDP is defined recursively with a class of coherent risk measures [9], and it evaluates the riskiness of the sum of the value of a coherent risk measure of immediate cost. [sent-27, score-1.12]
21 The uncertainty set of the robust MDP is characterized by the use of a representation of the coherent risk measure. [sent-28, score-0.519]
22 More specifically, the expected exponential utility evaluates the riskiness of the sum of the value of an entropic risk measure [6] of immediate cost. [sent-31, score-0.428]
23 The penalty function measures the deviation of the values of the probability mass functions from their nominal values using the Kullback-Leibler divergence. [sent-32, score-0.246]
24 2 Robust representations of iterated coherent risk measures Throughout this paper, we consider Markov decision processes over a finite horizon, so that there are N decision epochs. [sent-34, score-0.651]
25 We assume that a nominal transition probability, p0 (s |s, a), is associated with the transition from each state s ∈ Sn to each state s ∈ Sn+1 given that the action a ∈ A(s) is taken at s for n = 0, . [sent-40, score-0.326]
26 For a robust MDP, the corresponding true transition probability, p(s |s, a), has the uncertainty that will be specified in the sequel. [sent-44, score-0.288]
27 We assume that C(s, a) has a nominal probability distribution, but the true probability distribution for a robust MDP has the uncertainty that will be specified in the sequel. [sent-46, score-0.4]
28 1 Special case of the iterated conditional tail expectation We start by studying a robust MDP where the uncertainty is specified by the factor, α, such that 0 < α < 1, which determines the possible deviation from the nominal value. [sent-51, score-0.606]
29 Specifically, for each pair of s ∈ Sn and a ∈ A(s), the true transition probabilities are in the following uncertainty set: 1 0 ≤ p(s |s, a) ≤ p0 (s |s, a), ∀s ∈ Sn+1 and p(s |s, a) = 1. [sent-52, score-0.155]
30 1, we assume that the cost C(s, a) is deterministic and has no uncertainty. [sent-54, score-0.104]
31 Our key finding is that there is a risk-sensitive MDP that is equivalent to the robust MDP having the objective (2). [sent-56, score-0.239]
32 Specifically, consider the risk-sensitive MDP, where the transition probability is given by p0 , and the cost C(s, a) is deterministic given s and a. [sent-57, score-0.18]
33 In (4), denotes the ICTE of C(π) (N −i) ˜ conditioned on the state at the i-th decision epoch. [sent-63, score-0.108]
34 CTE is also α known as conditional value at risk or average value at risk and is formally defined as follows for a random variable Y : (1 − β)E[Y |Y > Vα ] + (β − α)Vα CTEα [Y ] ≡ , (6) 1−α where Vα ≡ min{y | FY (y) ≥ α}, and FY is the cumulative distribution function of Y . [sent-66, score-0.475]
35 For a continuous Y , or unless there is a mass probability at Vα , we have CTEα [Y ] = E[Y |Y > Vα ]. [sent-67, score-0.09]
36 Specifically, let Ci be π π ˜ the cost incurred at the i-th epoch with policy π so that C(π) = C0 + · · · + CN −1 . [sent-69, score-0.203]
37 What (9) suggests is that the ICTE of the cumulative π π cost given Si can be represented by the cost already accumulated C0 +· · ·+Ci−1 plus the maximum possible expected value of the sum of the cost incurred at Si and the ICTE of the cumulative cost to be incurred from the (i + 1)st epoch. [sent-71, score-0.593]
38 When the immediate cost from a state is deterministic given that state and the action from that state, the risk-sensitive MDP with the objective (3) is equivalent to the robust MDP with the objective (2). [sent-74, score-0.537]
39 Throughout, we say that a risk-sensitive MDP is equivalent to a robust MDP if the two MDPs have a common state space, and, regardless of the values of the parameters of the MDPs, the optimal action for one MDP coincides with that for the other for every state. [sent-75, score-0.253]
40 2 Relation between cost uncertainty and risk-sensitivity In addition to the transition probabilities, we now assume that the probability distribution of cost has uncertainty. [sent-77, score-0.314]
41 Because the uncertainty sets, (1) and (10), are both convex, the existing technique [11] can still be used to efficiently find the optimal policy with respect to ˜ min max Ep,f [C(π)], (11) π 1 2 p∈Up ,f ∈Uf CTEα [Y + b] = CTEα [Y ] + b for a random Y and a deterministic b. [sent-79, score-0.362]
42 Again, our key finding is that there is a risk-sensitive MDP that is equivalent to the robust MDP having the objective (11). [sent-83, score-0.239]
43 To define the objective of the equivalent risk-sensitive MDP, let D(s, a) ≡ ˜ CTEα [C(s, a)] and let D(π) be the cumulative value of D(s, a) along the sequence of (s, a) with a policy π. [sent-84, score-0.337]
44 Then the objective of the equivalent risk-sensitive MDP is given by ˜ min ICTE(N ) D(π) . [sent-85, score-0.144]
45 The risk-sensitive MDP with the objective (12) is equivalent to the robust MDP with the objective (11). [sent-88, score-0.306]
46 3 General case of coherent risk measures The robust MDPs considered in Section 2. [sent-90, score-0.43]
47 We now introduce a broader class of robust MDPs and equivalent risk-sensitive MDPs. [sent-93, score-0.172]
48 To define the broader class of robust MDPs, we study the uncertainty set of (1) and (10) in more detail. [sent-94, score-0.233]
49 Given a random variable that takes value vi with nominal probability pi for i = 1, . [sent-95, score-0.166]
50 , m, a step of finding the optimal robust policy calculates the maximum possible expected value: q1 v1 + · · · + qm vm max q 1 pi , ∀i = 1, . [sent-98, score-0.421]
51 Relaxing the constraints in (13), we obtain the following optimization problem, whose optimal solution is still given by (14): max q q1 v1 + · · · + qm vm i i q ≤g s. [sent-115, score-0.131]
52 Otherwise (specifically, 1−pm < α ≤ 1), the uncertainty set can become too small as qi ≤ pi /α, ∀i. [sent-127, score-0.149]
53 With an appropriate g, we can consider a sufficiently large uncertainty set for the pessimistic cases (e. [sent-130, score-0.1]
54 To formally define the uncertainty set for p(s |s, a), s ∈ Sn and a ∈ A(s), with the concave g, let Qp/p0 (·) denote the quantile function of a random variable that takes value p(s |s, a)/p0 (s |s, a) with probability p0 (s |s, a) for s ∈ Sn+1 . [sent-135, score-0.175]
55 Then p(s |s, a) and f (x|s, a) are in the uncertainty set iff we have, for 0 < t < 1, that 1 1 Qf /f0 (u) du ≤ g(t). [sent-137, score-0.1]
56 Qp/p0 (u) du ≤ g(t) and (16) 1−t 1−t Now (7) suggests that expectation with respect to the q illustrated in Figure 1(a) is the CTE with parameter α with respect to the corresponding p. [sent-138, score-0.118]
57 It can be shown that the expectation with respect to the q illustrated in Figure 1(b) is a coherent risk measure, CRM, of the following form [9]: 1 CRMH [Y ] CTEα [Y ] dH(α), = (17) 0 for a nondecreasing function H such that H(0) = 0 and H(1) = 1, where Y denotes a generic random variable. [sent-139, score-0.357]
58 ˜ Let K(s, a) ≡ CRMH [C(s, a)] and let K(π) be the cumulative value of K(s, a) along the sequence ˜ of (s, a) with a policy, π. [sent-142, score-0.121]
59 We define an iterated coherent risk measure (ICRM) of K(π) as follows: (N −i+1) ICRMH ˜ K(π) (N −i) ≡ CRMH ICRMH ˜ K(π)|Si , for i = 1, . [sent-143, score-0.491]
60 (19) This risk-sensitive MDP is equivalent to the robust MDP with the objective (11) if dg(t) = dt 1 t 1 dH(s) for s 0 < t < 1. [sent-150, score-0.263]
61 The expectation with respect to the q illustrated in Figure 1(c) can be represented by r1 CRMα1 [·] + r2 CRMα2 [·] with respect to the corresponding p. [sent-152, score-0.118]
62 Notice that Bellman’s optimality equations are satisfied both for the robust MDP and for the risk-sensitive MDP under consideration. [sent-156, score-0.193]
63 For the robust MDP, Bellman’s optimality equations are established in [11]. [sent-157, score-0.193]
64 For our risk-sensitive MDP, note that the coherent risk measure satisfies strong monotonicity, translation-invariance, and positive homogeneity that are used to establish Bellman’s optimality equations in [13]. [sent-158, score-0.389]
65 A difference between the risk-sensitive MDP in [13] ˜ and our risk-sensitive MDP is that the former minimizes the value of an iterated risk measure for C, (0) ˜ while the latter minimizes the value of an iterated risk measure (specifically, ICRMH ) for K. [sent-159, score-0.804]
66 5 The equivalence between our risk-sensitive MDP and our robust MDP can thus be established by showing that the two sets of Bellman’s optimality equations are equivalent. [sent-161, score-0.218]
67 For s ∈ Sn , Bellman’s optimality equation for our robust MDP is v(s) = min max a∈A(s) p∈Up ,f ∈Uf v(s ) p(s |s, a) , x f (x|s, a) + (21) s ∈Sn+1 x∈X (s,a) where v(s) denotes the value function representing the worst-case expected cumulative cost from s. [sent-162, score-0.456]
68 By (30) and (31), we have v(s) = min (CRMH [C0 (s, a)] + CRMH [V (s, a)]) , (32) a∈A(s) where the first CTEH is with respect to f0 (·|s, a); the second is with respect to p0 (·|s, a). [sent-186, score-0.092]
69 Because f0 (·|s, a) is independent of the state at n + 1, the translation invariance3 of CRMH implies v(s) = min CRMH [CRMH [C0 (s, a)] + V (s, a)], (33) a∈A(s) where the inner CRMH is with respect to f0 (·|s, a); the outer is with respect to p0 (·|s, a). [sent-187, score-0.175]
70 3 Robust representations of expected exponential utilities In this section, we study risk-sensitive MDPs whose objectives are defined with expected exponential utilities. [sent-189, score-0.098]
71 We will see that there are robust MDPs that are equivalent to these risk-sensitive MDPs. [sent-190, score-0.172]
72 We start by the standard risk-sensitive MDP [5, 7, 8, 15, 18] whose objective is to minimize ˜ ˜ E[exp(γ C(π))] for γ > 0. [sent-191, score-0.097]
73 Because γ > 0, minimizing E[exp(γ C(π))] is equivalent to mini1 ˜ ˜ mizing an entropic risk measure (ERM) [6, 13]: ERMγ [C(π)] ≡ γ ln E[exp(γ C(π))]. [sent-192, score-0.323]
74 ˜ It is now evident that the risk-sensitive MDP with the objective of minimizing E[exp(γ C(π))] is ˜ equivalent to a “robust” MDP with the objective of minimizing Eq [C(π)] − γ KL(q||q0 ) for the ˜ worst choice of q ∈ P(q0 ), where q0 denotes the probability mass function for C(π). [sent-199, score-0.39]
75 Here, the uncertainty is in the distribution of the cumulative cost, and it is nontrivial how this uncertainty is related to the uncertainty in the parameters, p and f , of the MDP. [sent-200, score-0.421]
76 ˜ Our goal is to explicitly relate the risk-sensitive MDP of minimizing E[exp(γ C(π))] to uncertainties in the parameters of the MDP. [sent-201, score-0.086]
77 For a moment, we assume that C(s, a) has no uncertainty and is deterministic given s and a, which will be relaxed later. [sent-202, score-0.135]
78 Let pπ (si+1 |si ) be the nominal transition 0 probability from si ∈ Si to si+1 ∈ Si+1 for i = 0, . [sent-204, score-0.343]
79 By the translation invariance and the recursiveness4 of ERM [13], we have N −1 ˜ ERMγ [C(π)] π π = C0 + ERMγ C1 + ERMγ π Ci |S1 , (35) i=2 where the inner ERM is with respect to pπ (·|S1 ); the outer is with respect to pπ (·|s0 ). [sent-208, score-0.104]
80 Let f0 (·|s) be the nominal probability mass function for the immediate cost from a state s under a policy π. [sent-219, score-0.463]
81 Consider the risk-sensitive MDP with the following objective: ˜ min ERMγ L(π) , π (38) ˜ where L(π) is the cumulative value of L(s, a) ≡ ERMγ [C(s, a)] along the sequence of (s, a) with a policy π. [sent-220, score-0.269]
82 The risk-sensitive MDP with the objective (38) is equivalent to the robust MDP with π the following objective, where f π ∈ P(f0 ) is defined analogously to pπ ∈ P(pπ ) in Theorem 4: 0 Epπ ,f π N −1 C π i i=0 min πmaxπ . [sent-223, score-0.315]
83 (39) N −2 N −1 π π π π π p ∈P(p0 ) −γ E π p i=0 KL (p (·|Si )||p0 (·|Si )) + i=0 KL (f (·|Si )||f0 (·|Si )) π π f ∈P(f0 ) 4 Conclusion We have shown relations between risk-sensitive MDPs and robust MDPs. [sent-224, score-0.17]
84 Because ERM is also an iterated risk measure [13], the objectives of the risk-sensitive MDPs studied in this paper are all with respect to some iterated risk measures. [sent-225, score-0.79]
85 The significance of iterated risk measures is intensively discussed in [13], but it can represent one’s preferences that cannot be represented by standard expected exponential utility and yet allows efficient optimization and consistent decision making. [sent-226, score-0.558]
86 ˜ While the prior work [13, 14, 19] minimizes the iterated risk measure of the cumulative cost (C(π) in Section 2), our study on the relation to a robust MDP suggests that one might want to minimize ˜ the iterated risk measure of the sum of the values of risk measures for immediate costs (e. [sent-227, score-1.428]
87 3 or L(π) in Section 3), because the latter is related to the robustness against uncertainty in cost. [sent-230, score-0.126]
88 This means that the optimal policy for the robust MDP studied in this paper can be found quite efficiently. [sent-232, score-0.266]
89 In particular, the robust MDP in Theorem 5 might not seem to allow an efficient optimization without the knowledge of the relation to the corresponding risk-sensitive MDP, whose optimal policy is readily available with dynamic programming. [sent-233, score-0.313]
90 Overall, the relation to a robust MDP can provide strong motivation for the corresponding risk-sensitive MDP and vice versa. [sent-234, score-0.16]
91 For simplicity, the uncertainty sets in Section 2 are characterized by a single parameter, α, or a single function, g, but it is trivial to extend our results to the cases where the uncertainty sets are defined differently depending on the particular states, actions, and other elements of the MDP. [sent-235, score-0.22]
92 In such cases, the objective of the corresponding risk-sensitive MDP is composed of various risk measures. [sent-236, score-0.244]
93 The uncertainty set in Section 3 depends only on the support of the nominal probability mass function. [sent-237, score-0.315]
94 The penalty for the deviation from the nominal value can be adjusted with a single parameter, γ, but it is also trivial to extend our results to the cases, where this parameter varies depending on the particular elements of the MDP. [sent-238, score-0.125]
95 In such cases, the objective of the corresponding risk-sensitive MDP is an iterated risk measure composed of ERM having varying parameters. [sent-239, score-0.469]
96 It would also be an interesting direction to extend our results to convex risk measures, which allows robust representations. [sent-240, score-0.31]
97 Robust control of Markov decision processes with uncertain transition matrices. [sent-304, score-0.193]
98 Robustness in Markov decision problems with uncertain transition matrices. [sent-310, score-0.173]
99 Iterated risk measures for risk-sensitive Markov decision processes with discounted cost. [sent-318, score-0.303]
100 On terminating Markov decision processes with a risk-averse objective function. [sent-328, score-0.162]
wordName wordTfidf (topN-words)
[('mdp', 0.58), ('icte', 0.316), ('crmh', 0.3), ('cte', 0.283), ('erm', 0.216), ('iterated', 0.184), ('risk', 0.177), ('si', 0.142), ('uf', 0.133), ('robust', 0.133), ('nominal', 0.125), ('cumulative', 0.121), ('sn', 0.119), ('policy', 0.11), ('uncertainty', 0.1), ('mdps', 0.09), ('coherent', 0.089), ('ri', 0.085), ('icrmh', 0.083), ('dh', 0.081), ('decision', 0.075), ('cost', 0.069), ('mass', 0.069), ('objective', 0.067), ('bellman', 0.061), ('transition', 0.055), ('qm', 0.054), ('ci', 0.053), ('osogami', 0.05), ('unveiled', 0.05), ('worst', 0.045), ('uncertainties', 0.045), ('uncertain', 0.043), ('utility', 0.042), ('measure', 0.041), ('minimizing', 0.041), ('crm', 0.041), ('expectation', 0.041), ('equivalent', 0.039), ('ep', 0.039), ('optimality', 0.039), ('min', 0.038), ('kl', 0.038), ('analogously', 0.038), ('relations', 0.037), ('immediate', 0.036), ('deterministic', 0.035), ('markov', 0.034), ('state', 0.033), ('riskiness', 0.033), ('concave', 0.032), ('eq', 0.032), ('measures', 0.031), ('dg', 0.03), ('tokyo', 0.03), ('minimize', 0.03), ('hoboken', 0.029), ('fy', 0.029), ('max', 0.029), ('qi', 0.029), ('relation', 0.027), ('expected', 0.027), ('respect', 0.027), ('nilim', 0.027), ('equality', 0.027), ('robustness', 0.026), ('outer', 0.026), ('cally', 0.026), ('evaluates', 0.025), ('delage', 0.025), ('entropic', 0.025), ('equivalence', 0.025), ('action', 0.025), ('vm', 0.025), ('theorem', 0.024), ('exchanging', 0.024), ('incurred', 0.024), ('dt', 0.024), ('translation', 0.024), ('qf', 0.023), ('optimal', 0.023), ('tail', 0.023), ('illustrated', 0.023), ('editors', 0.022), ('induction', 0.022), ('quantile', 0.022), ('maximizer', 0.022), ('establish', 0.022), ('exponential', 0.022), ('percentile', 0.022), ('equations', 0.021), ('programming', 0.021), ('probability', 0.021), ('processes', 0.02), ('recursively', 0.02), ('dynamic', 0.02), ('characterized', 0.02), ('pi', 0.02), ('edition', 0.02), ('optimistic', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
Author: Takayuki Osogami
Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1
2 0.30997759 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
3 0.16336142 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
Author: Teodor M. Moldovan, Pieter Abbeel
Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1
4 0.1389816 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
Author: Amy Greenwald, Jiacui Li, Eric Sodomka
Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1
6 0.12596634 38 nips-2012-Algorithms for Learning Markov Field Policies
7 0.1242004 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
8 0.10410042 271 nips-2012-Pointwise Tracking the Optimal Regression Function
9 0.10114289 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
10 0.10014346 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
11 0.098806739 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
12 0.097190753 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
13 0.091284126 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
14 0.085464075 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
15 0.081887521 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
16 0.081508353 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
17 0.081488729 292 nips-2012-Regularized Off-Policy TD-Learning
18 0.079854392 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
19 0.074637882 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
20 0.074378513 16 nips-2012-A Polynomial-time Form of Robust Regression
topicId topicWeight
[(0, 0.145), (1, -0.252), (2, 0.028), (3, -0.057), (4, -0.014), (5, 0.018), (6, -0.004), (7, 0.042), (8, -0.0), (9, -0.005), (10, 0.028), (11, 0.004), (12, -0.045), (13, 0.005), (14, -0.011), (15, -0.09), (16, 0.016), (17, -0.017), (18, 0.036), (19, 0.015), (20, -0.036), (21, 0.007), (22, 0.052), (23, 0.064), (24, -0.01), (25, -0.014), (26, 0.074), (27, 0.082), (28, -0.098), (29, 0.174), (30, -0.045), (31, -0.071), (32, -0.2), (33, 0.069), (34, 0.019), (35, 0.077), (36, -0.101), (37, 0.139), (38, 0.089), (39, 0.054), (40, -0.058), (41, 0.082), (42, 0.052), (43, -0.076), (44, -0.044), (45, -0.045), (46, -0.096), (47, 0.013), (48, -0.11), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.97111851 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
Author: Takayuki Osogami
Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1
2 0.83819854 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
3 0.82913303 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
Author: Teodor M. Moldovan, Pieter Abbeel
Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1
4 0.73177516 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand
Author: Amy Greenwald, Jiacui Li, Eric Sodomka
Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1
5 0.65807289 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
Author: Doina Precup, Joelle Pineau, Andre S. Barreto
Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1
7 0.58413428 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
8 0.54765403 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
9 0.54696107 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
10 0.54672199 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
11 0.54179043 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
12 0.48545188 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
13 0.46361384 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
14 0.43955794 38 nips-2012-Algorithms for Learning Markov Field Policies
15 0.4387385 358 nips-2012-Value Pursuit Iteration
16 0.43723729 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
17 0.38314936 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
18 0.37601992 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
19 0.3726503 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
20 0.36964712 161 nips-2012-Interpreting prediction markets: a stochastic approach
topicId topicWeight
[(0, 0.014), (21, 0.021), (35, 0.01), (38, 0.121), (39, 0.014), (42, 0.035), (54, 0.046), (55, 0.017), (71, 0.294), (74, 0.041), (76, 0.134), (80, 0.096), (92, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.77677613 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
Author: Takayuki Osogami
Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1
2 0.71137673 198 nips-2012-Learning with Target Prior
Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji
Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Specifically, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1
3 0.67643815 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan
Abstract: In this paper, we derive a novel algorithm to cluster hidden Markov models (HMMs) according to their probability distributions. We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We illustrate the benefits of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging. 1
4 0.6580137 179 nips-2012-Learning Manifolds with K-Means and K-Flats
Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco
Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-flats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-flats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-flats, both the results and the mathematical tools are new. 1
5 0.62230015 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
6 0.59703869 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
7 0.59449893 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
8 0.59396547 292 nips-2012-Regularized Off-Policy TD-Learning
9 0.59393299 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
10 0.59355921 38 nips-2012-Algorithms for Learning Markov Field Policies
11 0.59305406 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
12 0.59213424 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
13 0.59183747 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
14 0.59041029 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
15 0.58973467 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
16 0.58947796 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
17 0.5890224 197 nips-2012-Learning with Recursive Perceptual Representations
18 0.58848321 160 nips-2012-Imitation Learning by Coaching
19 0.58847767 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
20 0.58781189 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models