nips nips2012 nips2012-296 knowledge-graph by maker-knowledge-mining

296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds


Source: pdf

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The expected return is a widely used objective in decision making under uncertainty. [sent-5, score-0.177]

2 We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. [sent-8, score-0.239]

3 We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. [sent-9, score-0.425]

4 Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. [sent-10, score-0.135]

5 1 Introduction The expected return is often the objective function of choice in planning problems where outcomes not only depend on the actor’s decisions but also on random events. [sent-13, score-0.216]

6 In this setting, we can no longer take advantage of the law of large numbers to ensure that the return is close to its expectation with high probability, so the expected return might not be the best objective to optimize. [sent-20, score-0.195]

7 This is called minmax optimization and is sometimes useful, but, most often, the resulting policies are overly cautious. [sent-22, score-0.564]

8 A more balanced and general approach would include minmax optimization and expectation optimization, corresponding respectively to absolute risk aversion and risk ignorance, but would also allow a spectrum of policies between these extremes. [sent-23, score-1.085]

9 With these risks in mind, would you rather take a route with an expected travel time of 12:21 and no further guarantees, or would you prefer a route that takes less than 16:19 with 99% probability? [sent-27, score-0.209]

10 In the economics literature, this percentile criterion is known as value at risk and has become a widely used measure of risk after the market crash of 1987 [2]. [sent-34, score-0.534]

11 At the same time, the classical method for managing risk in investment is Markovitz portfolio optimization where the objective is to optimize expectation minus weighted variance. [sent-35, score-0.576]

12 These examples suggest that proper risk-aware planning should allow a trade-off between expectation and variance, and, at the same time, should provide some guarantees about the probability of failure. [sent-36, score-0.189]

13 Risk-aware planning for Markov decision processes (MDPs) is difficult for two main reasons. [sent-37, score-0.227]

14 Both mean minus variance optimization and percentile optimization for MDPs have been shown to be NP-hard in general [3, 4]. [sent-39, score-0.488]

15 Second, it seems to be difficult to find an optimization objective which correctly models our intuition of risk awareness. [sent-41, score-0.318]

16 Even though expectation, variance and percentile levels relate to risk awareness, optimizing them directly can lead to counterintuitive policies as illustrated recently in [3], for the case of mean minus variance optimization, and in the appendix of this paper, for percentile optimization. [sent-42, score-1.088]

17 The minmax objective has been proposed in [5, 6], which propose a dynamic programming algorithm for optimizing it efficiently. [sent-44, score-0.336]

18 A number of methods have been proposed for relaxations of mean minus variance optimization [3, 7]. [sent-46, score-0.267]

19 Percentile optimization has been shown to be tractable when dealing with ambiguity in MDP parameters [8, 9], and it has also been discussed in the context of risk [10, 11]. [sent-47, score-0.266]

20 Our approach is closest to the line of work on exponential utility optimization [12, 13]. [sent-48, score-0.281]

21 This problem can be solved efficiently and the resulting policies conform to our intuition of risk awareness. [sent-49, score-0.476]

22 For an overview of previously used objectives for risk-aware planning in MDPs, see [14, 15]. [sent-51, score-0.146]

23 We observe connections between exponential utility maximization, Chernoff bounds, and cumulant generating functions, which enables formulating a new optimization objective for risk-aware planning. [sent-53, score-0.457]

24 This new objective is essentially a re-parametrization of exponential utility, and inherits both the efficient optimization algorithms and the concordance to intuition about risk awareness. [sent-54, score-0.41]

25 We show that optimizing the proposed objective includes, as limiting cases, both minmax and expectation optimization and allows interpolation between them. [sent-55, score-0.452]

26 Additionally, we provide guarantees at a certain percentile level, and show connections to mean minus variance optimization. [sent-56, score-0.406]

27 Our experiments illustrate the ability of our approach to—out of the exponentially many policies available— produce a family of policies that agrees with the human intuition of varying risk. [sent-59, score-0.566]

28 2 Background and Notation An MDP consists of a state space S, an action space A, state transition dynamics, and a cost function G. [sent-60, score-0.165]

29 Once the player chooses an action at ∈ A, the system transitions stochastically to state st+1 ∈ S, with probability p(st+1 |st , at ), and the player incurs a stochastic cost of Gt (st , at , st+1 ). [sent-62, score-0.211]

30 The classical solution to this problem is to run value 2 iteration, summarized below: q t+1 (s, a) := t ps |s,a Gt s,a,s + j (s ) , j t (s) := min q t (s, a) = min Es,π [J t ] s a π We will refer to policies obtained by standard value iteration as expectimin policies. [sent-69, score-0.408]

31 The notation Es,π signifies taking the expectation starting from S0 = s, and following policy π. [sent-72, score-0.164]

32 We assume that costs are upper bounded, that is there exists jM such that J ≤ jM almost surely for any start state and any policy, and that the expected costs are finite. [sent-73, score-0.159]

33 If necessary, discounting can be introduced in one of two ways: either by adding a transition from every state, for all actions, to an absorbing “end game” state, with probability γ, or by setting a time dependent cost as Gt = γ t Gt . [sent-75, score-0.167]

34 Note that these two ways of introducing discounting are equivalent when new old optimizing the expected cost, but they can differ in the risk-aware setting we are considering. [sent-76, score-0.127]

35 3 The Chernoff Functional as Risk-Aware Objective We propose optimizing the following functional of the cost, which we call the Chernoff functional since it often appears in proving Chernoff bounds: δ Cs,π [J] = inf θ log Es,π eJ/θ − θ log(δ) . [sent-78, score-0.28]

36 θ>0 (1) First, note the total cost appears in the expression of the Chernoff functional as an exponential utility (Es,π [eJ/θ ]). [sent-79, score-0.377]

37 This shows that there is a strong connection between our method and exponential utility optimization. [sent-80, score-0.249]

38 Specifically, all policies proposed by our algorithm, including the final solution, are optimal policies with respect to the exponential utility for some parameter. [sent-81, score-0.774]

39 These policies are known to show risk-awareness in practice [12, 13], and our method inherits this property. [sent-82, score-0.31]

40 In some sense, our proposed objective is a re-parametrization of exponential utility, which was obtained through its connections to Chernoff bounds and cumulant generating functions. [sent-83, score-0.241]

41 The theorem below, which is one of the main contributions of this paper, provides more reasons for optimizing the Chernoff functional in the risk-aware setting. [sent-84, score-0.16]

42 Then, the Chernoff functional of this random variable, C δ [J], is well defined, and has the following properties: (i) P (J ≥ C δ [J]) ≤ δ (ii) C 1 [J] = limθ→∞ θ log E[eJ/θ ] = E[J] (iii) C 0 [J] := limδ→0 C δ [J] = limθ→0 θ log E[eJ/θ ] = sup{j : P {J ≥ j} > 0} < ∞. [sent-88, score-0.156]

43 Properties (ii), (iii), (v) and (vi) follow from the following properties of cumulant generating function, log EezJ , [18]: (a) log EezJ = ∞ i=1 z i ki /i! [sent-94, score-0.164]

44 Properties (ii) and (iii) show that we can use the δ parameter to interpolate between the nominal policy, which ignores risk, at δ = 1, and the minmax policy, which corresponds to extreme risk aversion, at δ = 0. [sent-101, score-0.428]

45 Property (i) shows that the value of the Chernoff functional is with probability at least 1 − δ an upper bound on the cost obtained by following the corresponding Chernoff policy. [sent-102, score-0.142]

46 These two observations suggests that by tuning δ from 0 to 1 we can find a family of risk-aware policies, in order of risk aversion. [sent-103, score-0.193]

47 Property (i) shows a connection between our approach and percentile optimization. [sent-105, score-0.189]

48 Properties (iv) and (v) show a connection between optimizing the Chernoff functional and mean minus variance optimization, which has been proposed before for risk-aware planning, but was found to be intractable in general [3]. [sent-107, score-0.395]

49 Via property (v), we can optimize mean minus variance with a low weight on variance if we set δ close to 1. [sent-108, score-0.277]

50 Property (iv) show that we can optimize mean minus scaled standard deviation exactly if the total cost is Gaussian. [sent-110, score-0.27]

51 Our formulation ties into mean minus standard deviation optimization, which is of consistent dimensionality, unlike the classical mean minus variance objective. [sent-116, score-0.374]

52 The inner optimization problem, the optimization over policies π, is simply exponential utility optimization, a classical problem that can be solved efficiently. [sent-118, score-0.637]

53 Based on Theorems 1 and 2 (below), we propose a method for finding the policy that minimizes the Chernoff functional, to precision ε, with worst case time complexity O(h|S|2 |A|/ε). [sent-123, score-0.154]

54 In practice we might want to run the algorithm for more than one setting of δ to find policies for the same planning task at different levels of risk aversion, say at n different levels. [sent-127, score-0.59]

55 2 ˆ[0] ← f (0) f minimax cost of the MDP ˆ f [∞] ← f (∞) expectimin cost of the MDP ˆ ˆ for θ ∈ {1, 10, 100, · · · }, until f [∞] − f [θ] < ε, do find upper bound ˆ[θ] ← f (θ) f exponential utility optimization ˆ ˆ for θ ∈ {1, 0. [sent-129, score-0.576]

56 Properties (ii) and (iii) of Theorem 1 imply that f (0) can be computed by minimax optimization and f (∞) can be computed by value iteration (expectimin optimization), which both have the same time complexity as exponential utility optimization: O(h|S|2 |A|). [sent-133, score-0.335]

57 Now, for any given risk level, δ, we will find θ∗ that minimizes the objective, f (θ) − θ log(δ), among those θ where we have already evaluated f . [sent-140, score-0.193]

58 We stop iterating when the function value at the new point is less than ε away from the function value at θ∗ , and return the corresponding optimal exponential utility policy. [sent-142, score-0.258]

59 We can immediately see that fπ (θ) := θ log Es,π eJ/θ is a convex function since it is the perspective transformation of a convex function, namely, the cumulant generating function of the total cost J. [sent-151, score-0.213]

60 Since we assumed that the costs, J, are upper bounded, there exist a maximum cost jM such that J ≤ jM almost surely for any starting state s, and any policy π. [sent-154, score-0.218]

61 We have also shown that fπ (θ) ≥ Es,π [J] ≥ jm := minπ Es,π [J], so we conclude that −(jM − jm )/θ ≤ fπ (θ) ≤ 0 for any policy, π. [sent-155, score-0.414]

62 3} # ← δ ∈ {10−3 , 10−4 , 10−5 , 10−6 , 10−7 } ← δ ∈ {10−10 , 10−8 } Figure 2: Chernoff policies for the Grid World MDP. [sent-163, score-0.283]

63 The colored arrows indicate the most likely paths under Chernoff policies for different values of δ. [sent-165, score-0.283]

64 The minimax policy (δ = 0) acts randomly since it assumes that any action will lead to a trap. [sent-166, score-0.175]

65 One can easily verify that the following n(ε) satisfies this requirement (the detailed derivation appears in the Appendix): n(ε) = (jM − jm )/ε log (θ2 /θ1 ) + log (θ2 /θ1 ) . [sent-171, score-0.279]

66 The first experiment models a situation where it is immediately obvious what the family of risk-aware policies should be. [sent-173, score-0.283]

67 We show that optimizing the Chernoff functional with increasing values of δ produces the intuitively correct family of policies. [sent-174, score-0.16]

68 Additionally, we observed that policies at lower risk levels, δ, tend to have lower expectation but also lower variance, if the structure of the problem allows it. [sent-177, score-0.519]

69 2 12:35 14:25 17:50 23:40 SHV DFW MSY JFK - DFW MSY JFK BQN 13:30 15:50 21:46 04:20 (a) Paths under Chernoff policies assuming all flight arrive on time, shown using International Air Transport Association (IATA) airport codes. [sent-218, score-0.499]

70 0 −8 −7 −6 −5 Total reward: v (seconds) −4 ·104 (b) Cumulative distribution functions of rewards (equals minus cost) under Chernoff policies at different risk levels. [sent-219, score-0.624]

71 Figure 3: Chernoff policies to travel from Shreveport Regional Airport (SHV) to Rafael Hern´ ndez a Airport (BQN) at different risk levels. [sent-223, score-0.621]

72 With this setting, our algorithm performed exponential utility optimization for 97 different parameters when planning for 14 values of the risk level δ. [sent-226, score-0.588]

73 2 Air travel planning The aerial travel planning MDP (Figure 3) illustrates that our method applies to real-world problems at a large scale. [sent-230, score-0.45]

74 It models the problem of buying airplane tickets to travel between two cities, when you care only about reaching the destination in a reliable amount of time. [sent-231, score-0.301]

75 In our implementation, the state space consists of pairs of all airports and times when flights depart from those airports. [sent-234, score-0.138]

76 To keep the horizon low, we introduce enough wait transitions so that it takes no more than 10 transitions to wait a whole day in the busiest airport (about 1000 flights per day) and we set the horizon at 100. [sent-237, score-0.47]

77 Consequently, we first clustered together all flights with the same origin and destination that were scheduled to depart within 15 minutes of each other, under the assumption they would have the same on-time statistics. [sent-246, score-0.245]

78 7 80 20 60 40 10 20 0 0 0 20 40 60 80 100 0 (a) Number of exponential utility optimization runs to compute the Chernoff policies. [sent-248, score-0.281]

79 2 4 6 8 10 12 (b) Number of distinct Chernoff policies found. [sent-249, score-0.283]

80 Figure 4: Histograms demonstrating the efficiency and relevance of our algorithm on 500 randomly chosen origin - destination airport pairs, at 15 risk levels. [sent-250, score-0.546]

81 To test our algorithm on this problem, we randomly chose 500 origin - destination airport pairs and computed the Chernoff policies for risk levels: δ ∈ {1. [sent-251, score-0.829]

82 Figure 3 shows the resulting policies and corresponding cost (travel time) histograms for one such randomly chosen route. [sent-261, score-0.341]

83 To address the question of computational efficiency, Figure 4a shows a histogram of the total number of different parameters for which our algorithm ran exponential utility optimization. [sent-262, score-0.235]

84 To address the question of relevance, Figure 4b shows the number of distinct Chernoff policies found among the risk levels. [sent-263, score-0.476]

85 For most origin - destination pairs we found a rich spectrum of distinct policies, but there are also cases where all the Chernoff policies are identical or only the expectimax and minimax policies differ. [sent-265, score-0.786]

86 Many air travel routes exhibit only two phases mainly because they connect small airports where only one or two flights of the type we consider land or take off per day. [sent-266, score-0.312]

87 Consequently there will be few policies to choose from in these cases. [sent-267, score-0.283]

88 In our experiment, we chose 200 origin and destination pairs at random and, of these, 72 routes show only two phases. [sent-268, score-0.27]

89 In 41 of these cases, either the origin or the destination airport serves only one or two flights per day total. [sent-269, score-0.404]

90 Only 9 of the two-phase routes connect airports which both serve more than 10 flights per day total, and, of course, not all of these flight will help reach the destination. [sent-270, score-0.217]

91 Thus, typically the reason we see only two phases is that the choice of policies is very limited. [sent-271, score-0.283]

92 That is, they tend to set up routes such that, even in a worse than average scenario, the original route will tend to succeed. [sent-273, score-0.153]

93 6 Conclusion We proposed a new optimization objective for risk-aware planning called the Chernoff functional. [sent-274, score-0.239]

94 Our objective has a free parameter δ that can be used to interpolate between the nominal policy, which ignores risk, at δ = 1, and the minmax policy, which corresponds to extreme risk aversion, at δ = 0. [sent-275, score-0.48]

95 The value of the Chernoff functional is with probability at least 1 − δ an upper bound on the cost incurred by following the corresponding Chernoff policy. [sent-276, score-0.142]

96 We established a close connection between optimizing the Chernoff functional and mean minus variance optimization, which has been proposed before for risk-aware planning, but was found to be intractable in general. [sent-277, score-0.395]

97 We also establish a close connection with optimization of mean minus scaled standard deviation. [sent-278, score-0.262]

98 We proposed an efficient algorithm that optimizes the Chernoff functional to any desired accuracy ε requiring O(1/ε) runs of exponential utility optimization. [sent-279, score-0.292]

99 Our experiments illustrate the capability of our approach to recover a spread of policies in the spectrum from risk neutral to minmax requiring a running time that was on average about ten times the running of value iteration. [sent-280, score-0.684]

100 Percentile optimization in uncertain Markov decision processes with application to efficient exploration. [sent-301, score-0.213]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chernoff', 0.499), ('policies', 0.283), ('ights', 0.208), ('minmax', 0.208), ('jm', 0.207), ('risk', 0.193), ('airport', 0.187), ('minus', 0.148), ('percentile', 0.148), ('ight', 0.145), ('utility', 0.143), ('bqn', 0.125), ('dfw', 0.125), ('expectimin', 0.125), ('shv', 0.125), ('policy', 0.121), ('planning', 0.114), ('travel', 0.111), ('destination', 0.111), ('routes', 0.104), ('cumulant', 0.092), ('aversion', 0.092), ('mdp', 0.09), ('functional', 0.084), ('mco', 0.083), ('optimizing', 0.076), ('decision', 0.075), ('optimization', 0.073), ('cumulants', 0.068), ('st', 0.065), ('exponential', 0.065), ('airports', 0.062), ('costs', 0.06), ('cost', 0.058), ('origin', 0.055), ('minimax', 0.054), ('objective', 0.052), ('discounting', 0.051), ('day', 0.051), ('return', 0.05), ('route', 0.049), ('shie', 0.048), ('markov', 0.046), ('variance', 0.046), ('transitions', 0.046), ('gt', 0.046), ('hern', 0.043), ('expectation', 0.043), ('airlines', 0.042), ('eezj', 0.042), ('ewr', 0.042), ('jfk', 0.042), ('msy', 0.042), ('scheduled', 0.042), ('shreveport', 0.042), ('tickets', 0.042), ('mdps', 0.041), ('connection', 0.041), ('state', 0.039), ('iv', 0.038), ('processes', 0.038), ('horizon', 0.038), ('optimize', 0.037), ('depart', 0.037), ('moldovan', 0.037), ('airline', 0.037), ('keys', 0.037), ('buying', 0.037), ('atl', 0.037), ('erick', 0.037), ('log', 0.036), ('air', 0.035), ('player', 0.034), ('ndez', 0.034), ('historical', 0.034), ('precision', 0.033), ('var', 0.032), ('objectives', 0.032), ('connections', 0.032), ('berkeley', 0.032), ('guarantees', 0.032), ('ties', 0.032), ('grid', 0.032), ('delage', 0.032), ('rafael', 0.032), ('departure', 0.032), ('wait', 0.032), ('managing', 0.03), ('argmin', 0.029), ('actor', 0.029), ('absorbing', 0.029), ('transition', 0.029), ('arrive', 0.029), ('additionally', 0.028), ('actions', 0.028), ('total', 0.027), ('uncertain', 0.027), ('interpolate', 0.027), ('miss', 0.027), ('inherits', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

2 0.3045271 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

3 0.18516102 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

4 0.16648792 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

5 0.16336142 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

6 0.12058304 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.098645031 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

8 0.097967222 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

9 0.096517518 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

10 0.095598139 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

11 0.095451586 344 nips-2012-Timely Object Recognition

12 0.092492126 295 nips-2012-Risk-Aversion in Multi-armed Bandits

13 0.087047108 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

14 0.082611583 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

15 0.080585882 160 nips-2012-Imitation Learning by Coaching

16 0.077634461 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

17 0.075629175 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

18 0.070002913 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

19 0.068874091 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

20 0.067901462 126 nips-2012-FastEx: Hash Clustering with Exponential Families


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, -0.261), (2, 0.022), (3, -0.058), (4, -0.021), (5, 0.022), (6, -0.0), (7, 0.033), (8, 0.018), (9, -0.015), (10, 0.024), (11, -0.008), (12, -0.028), (13, -0.005), (14, -0.003), (15, -0.077), (16, 0.016), (17, -0.04), (18, -0.011), (19, 0.003), (20, -0.027), (21, -0.014), (22, 0.017), (23, 0.04), (24, 0.064), (25, -0.036), (26, 0.035), (27, 0.051), (28, -0.07), (29, 0.089), (30, -0.047), (31, -0.072), (32, -0.203), (33, 0.091), (34, 0.017), (35, 0.093), (36, -0.025), (37, 0.123), (38, 0.067), (39, 0.089), (40, -0.043), (41, 0.065), (42, -0.022), (43, -0.02), (44, -0.08), (45, -0.018), (46, -0.02), (47, 0.046), (48, -0.009), (49, -0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93596941 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

2 0.91618544 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

3 0.89788884 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

4 0.68839693 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

5 0.65883398 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

6 0.632631 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

7 0.6303646 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

8 0.627262 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

9 0.6254738 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

10 0.56606734 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

11 0.5638504 38 nips-2012-Algorithms for Learning Markov Field Policies

12 0.56350166 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

13 0.55695313 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

14 0.51938373 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

15 0.51284337 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

16 0.50822467 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

17 0.48580584 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

18 0.47699785 160 nips-2012-Imitation Learning by Coaching

19 0.4575569 358 nips-2012-Value Pursuit Iteration

20 0.4107599 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (21, 0.04), (36, 0.011), (38, 0.109), (42, 0.043), (54, 0.065), (55, 0.038), (70, 0.249), (71, 0.013), (74, 0.057), (76, 0.133), (80, 0.069), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76924932 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

Author: Teodor M. Moldovan, Pieter Abbeel

Abstract: The expected return is a widely used objective in decision making under uncertainty. Many algorithms, such as value iteration, have been proposed to optimize it. In risk-aware settings, however, the expected return is often not an appropriate objective to optimize. We propose a new optimization objective for risk-aware planning and show that it has desirable theoretical properties. We also draw connections to previously proposed objectives for risk-aware planing: minmax, exponential utility, percentile and mean minus variance. Our method applies to an extended class of Markov decision processes: we allow costs to be stochastic as long as they are bounded. Additionally, we present an efficient algorithm for optimizing the proposed objective. Synthetic and real-world experiments illustrate the effectiveness of our method, at scale. 1

2 0.71048528 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil

Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1

3 0.64040232 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

4 0.63836831 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1

5 0.63699132 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

6 0.63371873 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

7 0.63182443 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

8 0.63133186 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.6303044 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

10 0.62982374 348 nips-2012-Tractable Objectives for Robust Policy Optimization

11 0.62912482 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

12 0.62880141 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

13 0.62554926 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

14 0.62404948 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

15 0.62349689 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

16 0.62342805 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

17 0.62233913 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

18 0.62197232 188 nips-2012-Learning from Distributions via Support Measure Machines

19 0.62196261 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.62163866 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model