nips nips2012 nips2012-88 knowledge-graph by maker-knowledge-mining

88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Source: pdf

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. [sent-8, score-0.624]

2 In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. [sent-9, score-0.395]

3 We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. [sent-10, score-0.217]

4 We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. [sent-11, score-0.152]

5 1 Introduction In reinforcement learning (RL), the agent interacts with a (partially) unknown environment, classically assumed to be a Markov decision process (MDP), with the goal of maximizing its expected long-term total reward. [sent-12, score-0.331]

6 The agent faces the exploration-exploitation dilemma: the agent must select actions that exploit its current knowledge about the environment to maximize reward, but it also needs to select actions that explore for more information so that it can act better. [sent-13, score-0.476]

7 For example, a robot exploring in an unfamiliar terrain may reach a dangerous location and sustain heavy damage, or wander off from the recharging station to the point where a costly rescue mission is required. [sent-16, score-0.329]

8 The CMDP assumes that executing actions incur costs and rewards that should be optimized separately. [sent-20, score-0.206]

9 Assuming the expected total reward and cost criterion, the goal is to ﬁnd an optimal policy that maximizes the expected total reward while bounding the expected total cost. [sent-21, score-0.765]

10 Since we can naturally encode undesirable behaviors into the cost function, we formulate the costsensitive exploration problem as RL in the environment modeled as a CMDP. [sent-22, score-0.384]

11 Note that we can employ other criteria for the cost constraint in CMDPs. [sent-23, score-0.184]

12 We can make the actual total cost below the cost bound with probability one using the sample-path cost constraints [6, 7], or with probability 1 − δ using the percentile cost constraints [8]. [sent-24, score-0.784]

13 In this paper, we restrict ourselves to the expected total cost constraint mainly due to the computational efﬁciency in solving the constrained optimization problem. [sent-25, score-0.35]

14 Extending our work to other cost criteria is left as a future work. [sent-26, score-0.163]

15 The main argument we make is that the CMDP provides a natural framework for representing various approaches to constrained exploration, such as safe exploration [9, 10]. [sent-27, score-0.234]

16 1 In order to perform cost-sensitive exploration in the Bayesian RL (BRL) setting, we cast the problem as a constrained partially observable MDP (CPOMDP) [11, 12] planning problem. [sent-28, score-0.238]

17 Speciﬁcally, we take a model-based BRL approach and extend BEETLE [4] to solve the CPOMDP which models BRL with cost constraints. [sent-29, score-0.163]

18 We brieﬂy review the CMDP and CPOMDP before summarizing BEETLE, a model-based BRL for environments without cost constraints. [sent-32, score-0.163]

19 b0 is optional, since an optimal policy π ∗ : S → A that maps from states to actions can be shown not to be dependent on b0 . [sent-35, score-0.254]

20 The constrained MDP (CMDP) is deﬁned by tuple S, A, T, R, C, c, γ, b0 with the following addiˆ tional components: C(s, a) ∈ ℜ is the cost function which denotes the immediate cost incurred by executing action a in state s; c is the bound on expected total discounted cost. [sent-36, score-0.956]

21 ˆ An optimal policy of a CMDP maximizes the expected total discounted reward over the inﬁnite horizon, while not incurring more than c total discounted cost in the expectation. [sent-37, score-0.712]

22 ˆ ∞ where V π = Eπ,b0 [ t=0 γ t R(st , at )] is the expected total discounted reward, and C π = ∞ Eπ,b0 [ t=0 γ t C(st , at )] is the expected total discounted cost. [sent-41, score-0.318]

23 We will also use C π (s) to denote the expected total cost starting from the state s. [sent-42, score-0.31]

24 It has been shown that an optimal policy for CMDP is generally a randomized stationary policy [5]. [sent-43, score-0.399]

25 Hence, we deﬁne a policy π as a mapping of states to probability distributions over actions, where π(s, a) denotes the probability that an agent will execute action a in state s. [sent-44, score-0.554]

26 We can ﬁnd an optimal policy by solving the following linear program (LP): max x R(s, a)x(s, a) (1) s,a x(s′ , a) − γ s. [sent-45, score-0.152]

27 a x(s, a)T (s, a, s′ ) = b0 (s′ ) ∀s′ s,a C(s, a)x(s, a) ≤ c and x(s, a) ≥ 0 ∀s, a ˆ s,a The variables x’s are related to the occupancy measure of optimal policy, where x(s, a) is the expected discounted number of times executing a at state s. [sent-47, score-0.279]

28 If the above LP yields a feasible solution, optimal policy can be obtained by π(s, a) = x(s, a)/ a′ x(s, a′ ). [sent-48, score-0.152]

29 Note that due to the introduction of cost constraints, the resulting optimal policy is contingent on the initial state distribution b0 , in contrast to the standard MDP of which an optimal policy can be independent of the initial state distribution. [sent-49, score-0.605]

30 Note also that the above LP may be infeasible if there is no policy that can satisfy the cost constraint. [sent-50, score-0.315]

31 The standard POMDP is deﬁned by tuple S, A, Z, T, O, R, γ, b0 with the following additional components: the set Z of observations z, and the observation probability O(s′ , a, z) representing the probability Pr(z|s′ , a) of observing z when executing action a and changing to state s′ . [sent-52, score-0.348]

32 The Bellman backup operator for CPOMDP generates pairs of α-vectors (αR , αC ), each vector corresponding to the expected total reward and cost, respectively. [sent-56, score-0.462]

33 In order to facilitate deﬁning the Bellman backup operator at a belief state, we augment the belief state with a scalar quantity called admissible cost [13], which represents the expected total cost that can be additionally incurred for the future time steps without violating the cost constraint. [sent-57, score-1.22]

34 Suppose that, at time step t, the agent t τ has so far incurred a total cost of Wt , i. [sent-58, score-0.391]

35 The admissible cost at 1 time step t + 1 is deﬁned as dt = γ t+1 (ˆ − Wt ). [sent-61, score-0.25]

36 The point-based backup for CPOMDP leveraging the above LP formulation is shown in Algorithm 1. [sent-64, score-0.232]

37 1 1 Note that this algorithm is an improvement over the heuristic distribution of the admissible cost to each observation by ratio Pr(z|b, a) in [12]. [sent-65, score-0.25]

38 Instead, we optimize the cost distribution by solving an LP. [sent-66, score-0.163]

39 In summary, the hyperstate POMDP augments the original state space with the set of unknown parameters s,s′ {θa }, since the agent has to take actions without exact information on the unknown parameters. [sent-70, score-0.608]

40 The belief state b in the hyperstate POMDP yields the posterior of θ. [sent-71, score-0.434]

41 Solving the hyperstate POMDP is performed by dynamic programming with the Bellman backup operator [2]. [sent-76, score-0.482]

42 Speciﬁcally, the value function is represented as a set Γ of α-functions for each state ∗ s, so that the value of optimal policy is obtained by Vs (b) = maxα∈Γ αs (b) where αs (b) = b(θ)αs (θ)dθ. [sent-77, score-0.221]

43 There are two computational challenges with the hyperstate POMDP approach. [sent-79, score-0.25]

44 First, being a POMDP, the Bellman backup has to be performed on all possible belief states in the probability simplex. [sent-80, score-0.382]

45 BEETLE adopts Perseus [14], performing randomized point-based backups conﬁned to the set of sampled (s, b) pairs by simulating a default or random policy, and reducing the total number of value backups by improving the value of many belief points through a single backup. [sent-81, score-0.367]

46 Speciﬁcally, we formulate cost-sensitive BRL as a hyperstate CPOMDP SP , AP , ZP , TP , OP , RP , CP , c, γ, b0 where ˆ s,s′ s,s′ SP = S × {θa }, AP = A, ZP = S, TP (s, θ, a, s′ , θ′ ) = θa δθ (θ′ ), OP (s′ , θ′ , a, z) = δs′ (z), RP (s, θ, a) = R(s, a), and CP (s, θ, a) = C(s, a). [sent-86, score-0.273]

47 Note that using the cost function C and cost bound c to encode the constraints on the exploration ˆ behaviour allows us to enjoy the same ﬂexibility as using the reward function to deﬁne the task objective in the standard MDP and POMDP. [sent-87, score-0.616]

48 In addition, we can even completely eliminate the possibility of executing action a in state s by setting the discount factor to 1 for the cost constraint and impose a sufﬁciently low cost bound c < C(s, a). [sent-89, score-0.738]

49 As in BEETLE, α-vectors for the expected total reward and cost are represented as α-functions in terms of unknown parameters. [sent-91, score-0.388]

50 The point-based backup operator in Algorithm 1 naturally extends to α-functions without signiﬁcant increase in the computation complexity: the size of LP does not increase even though the belief states represent probability distributions over unknown parameters. [sent-92, score-0.421]

51 Algorithm 2 shows the point-based backup of α-functions in the hyperstate CPOMDP. [sent-93, score-0.482]

52 We also implemented the randomized point-based backup to further improve the performance. [sent-96, score-0.327]

53 The key step in the randomized value update is to check whether a newly generated α-function pairs Γ = {(αi,R , αi,C )} from a point-based backup yields improved value at some other sampled belief state (s, n, d). [sent-97, score-0.555]

54 We can obtain the value of Γ at the belief state by solving the following LP: wi αi,C (b) ≤ d i wi = 1 wi ≥ 0 ∀i i wi αi,R (b) max wi subject to i (2) If we can ﬁnd an improved value, we skip the point-based backup at (s, n, d) in the current iteration. [sent-98, score-0.724]

55 In summary, the point-based value iteration algorithm for CPOMDP and BEETLE readily provide all the essential computational tools to implement the hyperstate CPOMDP planning for the costsensitive BRL. [sent-100, score-0.32]

56 2 with Γ˜ b ˜ B = {b ∈ B : V ′ (b) < V (b)} return {Γ′ |∀s ∈ S} s (a) (b) Figure 1: (a) 5-state chain: each edge is labeled with action, reward, and cost associated with the transition. [sent-105, score-0.163]

57 (b) 6 × 7 maze: a 6 × 7 grid including the start location with recharging station (S), goal location (G), and 3 ﬂags to capture. [sent-106, score-0.32]

58 The ﬁrst one is the 5-state chain [15, 16, 4], and the second one is the 6 × 7 maze [16]. [sent-108, score-0.171]

59 1 Description of Problems The 5-state chain problem is shown in Figure 1a, where the agent has two actions 1 and 2. [sent-110, score-0.237]

60 The agent receives a large reward of 10 by executing action 1 in state 5, or a small reward of 2 by executing action 2 in any state. [sent-111, score-0.937]

61 2, the agent slips and makes the transition corresponding to the other action. [sent-113, score-0.189]

62 We deﬁned the constrained version of the problem by assigning a cost of 1 for action 1 in every state, thus making the consecutive execution of action 1 potentially violate the cost constraint. [sent-114, score-0.755]

63 The 6 × 7 maze problem is shown in Figure 1b, where the white cells are navigatable locations and gray cells are walls that block navigation. [sent-115, score-0.145]

64 Every “move” action (except for the stay action) can fail with probability 0. [sent-117, score-0.189]

65 If the agent bumps into a wall, the action will have no effect. [sent-119, score-0.298]

66 Upon reaching the goal, the agent obtains a reward equal to the number of ﬂags captured, and the agent gets warped back to the start location. [sent-121, score-0.396]

67 Since there are 33 reachable locations in the maze and 8 possible combinations for the status of captured ﬂags, there are a total of 264 states. [sent-122, score-0.237]

68 We deﬁned the constrained version of the problem by assuming that the agent is equipped with a battery and every action consumes energy except the stay action at 6 recharging station. [sent-123, score-0.817]

69 We modeled the power consumption by assigning a cost of 0 for executing the stay action at the recharging station, and a cost of 1 otherwise. [sent-124, score-0.794]

70 Thus, the battery recharging is done by executing stay action at the recharging station, as the admissible cost increases by factor 1/γ. [sent-125, score-0.96]

71 2 Results Table 1 summarizes the experimental results for the constrained chain and maze problems. [sent-127, score-0.259]

72 Both chain-tied and chain-semi assume that the transition dynamics are known to the agent except for the slip probabilities. [sent-129, score-0.278]

73 In chain-tied, the slip probability is assumed to be independent of state and action, thus there is only one unknown parameter in transition dynamics. [sent-130, score-0.242]

74 In chain-semi, the slip probability is assumed to be action dependent, thus there are two unknown parameters since there are two actions. [sent-131, score-0.282]

75 We excluded experimenting with the “full” prior model (completely unknown transition dynamics) since even BEETLE was not able to learn a near-optimal policy as reported in [4]. [sent-133, score-0.236]

76 We report the average discounted total reward and cost as well as their 95% conﬁdence intervals for the ﬁrst 1000 time steps using 200 simulated trials. [sent-134, score-0.401]

77 We performed 60 Bellman iterations on 500 belief states, and used the ﬁrst 50 belief states for choosing the set of basis functions. [sent-135, score-0.291]

78 When c=100, which is the maximum expected total cost that can be incurred by any policy, CBEEˆ TLE found policies that are as good as the policy found by BEETLE since the cost constraint has no effect. [sent-138, score-0.641]

79 As we impose tighter cost constraints by c=75, 50, and 25, the policies start to trade off the ˆ rewards in order to meet the cost constraint. [sent-139, score-0.424]

80 Note also that, although we use approximations in the various stages of the algorithm, c is within the conﬁdence intervals of the average total cost, meaning ˆ that the cost constraint is either met or violated by statistically insigniﬁcant amounts. [sent-140, score-0.233]

81 Since chainsemi has more unknown parameters than chain-tied, it is natural that the performance of CBEETLE policy is slighly degraded in chain-semi. [sent-141, score-0.191]

82 Note also that as we impose tighter cost constraints, the running times generally increase. [sent-142, score-0.163]

83 This is because the cost constraint in the LP tends to become active at more belief states, generating two α-function pairs instead of a single α-function pair when the cost constaint in the LP is not active. [sent-143, score-0.506]

84 Due to the computational requirement for solving the large hyperstate CPOMDP, we only experimented with the “tied” prior model which assumes that the slip probability is shared by every state and action. [sent-146, score-0.408]

85 95) = 20 is equivalent to running BEETLE without ˆ cost constraints, as veriﬁed in the table. [sent-148, score-0.163]

86 We further analyzed the cost-sensitive exploration behaviour in the maze problem. [sent-149, score-0.297]

87 Figure 2 compares the policy behaviors of BEETLE and CBEETLE(ˆ=18) in the maze problem. [sent-150, score-0.297]

88 The BEETLE c policy generally captures the top ﬂag ﬁrst (Figure 2a), then navigates straight to the goal (Figure 2b) or captures the right ﬂag and navigates to the goal (Figure 2c). [sent-151, score-0.324]

89 If it captures the right ﬂag ﬁrst, it then navigates to the goal (Figure 2d) or captures the top ﬂag and navigates to the goal (Figure 2e). [sent-152, score-0.172]

90 The CBEETLE policy shows a similar capture behaviour, but it stays at the recharging station for a number of time steps between the ﬁrst and second ﬂag captures, which can be conﬁrmed by the high state visitation frequency for the cell S in Figures 2g and 2i. [sent-154, score-0.536]

91 This is because the policy cannot navigate to the other ﬂag position and move to the goal without recharging the battery in between. [sent-155, score-0.394]

92 The agent also frequently visits the recharging station before the ﬁrst ﬂag capture (Figure 2f) because it actively explores for the ﬁrst ﬂag with a high uncertainty in the dynamics. [sent-156, score-0.418]

93 We can set γ = 1 and make the cost function assign, e. [sent-158, score-0.163]

94 , a cost of -1 for recharging and 1 for consuming, but our implementation currently assumes same discount factor for the rewards and costs. [sent-160, score-0.449]

95 7 Table 1: Experimental results for the chain and maze problems. [sent-162, score-0.171]

96 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 2: State visitation frequencies of each location in the maze problem over 100 runs. [sent-233, score-0.209]

97 (a-e) Behavior of BEETLE (a) before the ﬁrst ﬂag capture, (b) after the top ﬂag captured ﬁrst, (c) after the top ﬂag captured ﬁrst and the right ﬂag second, (d) after the right ﬂag captured ﬁrst, and (e) after the right ﬂag captured ﬁrst and the top ﬂag second. [sent-235, score-0.172]

98 5 Conclusion In this paper, we proposed CBEETLE, a model-based BRL algorithm for cost-sensitive exploration, extending BEETLE to solve the hyperstate CPOMDP which models BRL using cost constraints. [sent-238, score-0.413]

99 While our experiments show that the policies generally satisfy the cost constraints, it can still potentially violate the constraints since we approximate the alpha functions using a ﬁnite number of basis functions. [sent-241, score-0.307]

100 As for the future work, we plan to focus on making CBEETLE more robust to the approximation errors by performing a constrained optimization when approximating alpha functions to guarantee that we never violate the cost constraints. [sent-242, score-0.31]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('beetle', 0.375), ('brl', 0.286), ('cpomdp', 0.286), ('hyperstate', 0.25), ('backup', 0.232), ('cbeetle', 0.232), ('recharging', 0.179), ('cost', 0.163), ('ag', 0.158), ('cmdp', 0.158), ('pomdp', 0.156), ('action', 0.154), ('policy', 0.152), ('maze', 0.145), ('agent', 0.144), ('wiz', 0.125), ('wa', 0.116), ('belief', 0.115), ('exploration', 0.112), ('lp', 0.112), ('wis', 0.11), ('foreach', 0.109), ('reward', 0.108), ('executing', 0.1), ('randomized', 0.095), ('station', 0.095), ('slip', 0.089), ('constrained', 0.088), ('admissible', 0.087), ('discounted', 0.081), ('mdp', 0.07), ('state', 0.069), ('discount', 0.068), ('actions', 0.067), ('bellman', 0.065), ('navigates', 0.063), ('battery', 0.063), ('wi', 0.057), ('environment', 0.054), ('zp', 0.052), ('ags', 0.052), ('total', 0.049), ('transition', 0.045), ('pairs', 0.044), ('captured', 0.043), ('visitation', 0.041), ('behaviour', 0.04), ('rewards', 0.039), ('sp', 0.039), ('tp', 0.039), ('unknown', 0.039), ('reinforcement', 0.038), ('planning', 0.038), ('korea', 0.037), ('op', 0.036), ('cmdps', 0.036), ('dirichlets', 0.036), ('regress', 0.036), ('states', 0.035), ('stay', 0.035), ('ap', 0.035), ('incurred', 0.035), ('safe', 0.034), ('violate', 0.033), ('decision', 0.032), ('wt', 0.032), ('backups', 0.032), ('costsensitive', 0.032), ('mission', 0.032), ('rl', 0.031), ('ns', 0.031), ('constraints', 0.03), ('ross', 0.03), ('policies', 0.029), ('kim', 0.029), ('expected', 0.029), ('perseus', 0.029), ('hyperparameter', 0.029), ('rp', 0.029), ('poupart', 0.027), ('alpha', 0.026), ('basis', 0.026), ('chain', 0.026), ('tuple', 0.025), ('avg', 0.024), ('bayesian', 0.024), ('markov', 0.024), ('captures', 0.023), ('location', 0.023), ('dir', 0.023), ('percentile', 0.023), ('subject', 0.023), ('formulate', 0.023), ('st', 0.022), ('pseudocode', 0.022), ('cp', 0.022), ('constraint', 0.021), ('dz', 0.021), ('defense', 0.021), ('operations', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

2 0.19704035 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary ﬁnite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1

3 0.18672952 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

4 0.16452117 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.16435888 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The ﬁeld has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functﬁon. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.15777157 344 nips-2012-Timely Object Recognition

7 0.14467086 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

8 0.14356001 348 nips-2012-Tractable Objectives for Robust Policy Optimization

9 0.13716508 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

10 0.13669738 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

11 0.13429007 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.13420416 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

13 0.12879819 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

14 0.12748727 160 nips-2012-Imitation Learning by Coaching

15 0.1185086 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

16 0.10442943 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

17 0.097190753 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

18 0.090011194 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

19 0.089125939 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

20 0.087714024 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.151), (1, -0.341), (2, -0.041), (3, -0.04), (4, -0.05), (5, 0.01), (6, -0.002), (7, -0.049), (8, -0.009), (9, -0.001), (10, -0.018), (11, -0.015), (12, 0.027), (13, 0.047), (14, 0.028), (15, -0.025), (16, -0.004), (17, 0.019), (18, 0.011), (19, 0.014), (20, 0.025), (21, 0.02), (22, -0.029), (23, 0.01), (24, -0.005), (25, 0.0), (26, 0.067), (27, 0.024), (28, 0.052), (29, 0.077), (30, 0.049), (31, 0.02), (32, 0.006), (33, 0.021), (34, 0.002), (35, 0.065), (36, -0.001), (37, 0.039), (38, 0.049), (39, -0.013), (40, -0.023), (41, 0.072), (42, 0.074), (43, -0.035), (44, -0.036), (45, -0.026), (46, 0.131), (47, 0.007), (48, -0.072), (49, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96426803 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

2 0.81417274 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

3 0.79306507 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

Abstract: We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efﬁcient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without predeﬁned mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains. 1

4 0.78799891 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

5 0.73473001 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artiﬁcial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always ﬁnishes and is asymptotically optimal under sufﬁcient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1

6 0.73454398 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

7 0.71114123 348 nips-2012-Tractable Objectives for Robust Policy Optimization

8 0.69354814 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.69271004 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.66116095 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

11 0.62810618 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

12 0.61694056 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

13 0.61323231 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

14 0.60587722 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

15 0.60454857 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

16 0.60434717 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

17 0.60191929 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

18 0.59102273 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

19 0.57753623 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

20 0.57666212 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (21, 0.013), (36, 0.012), (38, 0.088), (42, 0.018), (46, 0.292), (53, 0.02), (54, 0.13), (55, 0.016), (74, 0.05), (76, 0.105), (80, 0.094), (92, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72391075 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

2 0.71588707 256 nips-2012-On the connections between saliency and tracking

Author: Vijay Mahadevan, Nuno Vasconcelos

Abstract: A model connecting visual tracking and saliency has recently been proposed. This model is based on the saliency hypothesis for tracking which postulates that tracking is achieved by the top-down tuning, based on target features, of discriminant center-surround saliency mechanisms over time. In this work, we identify three main predictions that must hold if the hypothesis were true: 1) tracking reliability should be larger for salient than for non-salient targets, 2) tracking reliability should have a dependence on the deﬁning variables of saliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of saliency on these variables, and 3) saliency and tracking can be implemented with common low level neural mechanisms. We conﬁrm that the ﬁrst two predictions hold by reporting results from a set of human behavior studies on the connection between saliency and tracking. We also show that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking. This architecture is fully compliant with the standard physiological models of V1 and MT, and with what is known about attentional control in area LIP, while explaining the results of the human behavior experiments.

3 0.67151743 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

Author: Erik Talvitie

Abstract: This paper introduces timeline trees, which are partial models of partially observable environments. Timeline trees are given some speciﬁc predictions to make and learn a decision tree over history. The main idea of timeline trees is to use temporally abstract features to identify and split on features of key events, spread arbitrarily far apart in the past (whereas previous decision-tree-based methods have been limited to a ﬁnite sufﬁx of history). Experiments demonstrate that timeline trees can learn to make high quality predictions in complex, partially observable environments with high-dimensional observations (e.g. an arcade game). 1

4 0.66119343 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

Author: Zhihua Zhang, Bojun Tu

Abstract: In this paper we study sparsity-inducing nonconvex penalty functions using L´ vy e processes. We deﬁne such a penalty as the Laplace exponent of a subordinator. Accordingly, we propose a novel approach for the construction of sparsityinducing nonconvex penalties. Particularly, we show that the nonconvex logarithmic (LOG) and exponential (EXP) penalty functions are the Laplace exponents of Gamma and compound Poisson subordinators, respectively. Additionally, we explore the concave conjugate of nonconvex penalties. We ﬁnd that the LOG and EXP penalties are the concave conjugates of negative Kullback-Leiber (KL) distance functions. Furthermore, the relationship between these two penalties is due to asymmetricity of the KL distance. 1

5 0.58962047 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difﬁcult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a beneﬁt from setting ﬂexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

6 0.58248049 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

7 0.57915902 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

8 0.57886982 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

9 0.57439512 344 nips-2012-Timely Object Recognition

10 0.56855905 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

11 0.55171019 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

12 0.53924274 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

13 0.52883279 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

14 0.5270462 160 nips-2012-Imitation Learning by Coaching

15 0.52540588 38 nips-2012-Algorithms for Learning Markov Field Policies

16 0.52492887 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

17 0.52374107 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

18 0.52372611 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

19 0.51879621 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

20 0.51846761 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization