nips nips2012 nips2012-122 knowledge-graph by maker-knowledge-mining

122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Source: pdf

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary ﬁnite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. [sent-2, score-0.304]

2 We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. [sent-3, score-0.711]

3 We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. [sent-5, score-0.491]

4 With an efﬁcient model learner, the estimated transition model can be guaranteed to be approximately correct after a sufﬁcient, and efﬁcient, number of visitations in a stationary domain. [sent-10, score-0.37]

5 An alternative approach to exploration is Bayesian RL [11]. [sent-11, score-0.448]

6 Bayesian RL exploits prior knowledge about the transition dynamics to reason explicitly about the uncertainty of the estimated model. [sent-12, score-0.289]

7 Interestingly, these existing approaches estimate the accuracy of the currently learned model based only on visitation counts. [sent-13, score-0.259]

8 This is the case if the transition dynamics change over time (so there is no correct stationary prior), or if we want to be able to ignore non-learnable, currently “too difﬁcult” parts of the state space, either to the level of noise or limitations of a learning algorithm. [sent-19, score-0.459]

9 Previous work into this direction emphasizes the concept of intrinsic motivation [10, 13, 12] and has shown empirical success in developmental robotics [1]. [sent-22, score-0.293]

10 An interesting aspect about this work is its reliance on empirical measures of learning progress to drive exploration in reinforcement learning [17, 6]. [sent-23, score-0.881]

11 In this paper, we aim to draw these relations and make the following contributions: (i) We propose to drive exploration in model-based RL by the estimated progress in model learning. [sent-25, score-0.753]

12 We estimate this progress in terms of the loss over the training data used for model learning. [sent-26, score-0.201]

13 (ii) We introduce 1 two algorithms based on modiﬁcations of R- MAX and the recent Bayesian exploration bonus (BEB) approach [9]. [sent-27, score-0.564]

14 In contrast to the existing approaches, our algorithms do not have to assume correct prior knowledge or that the visitation counts translate directly to model certainty. [sent-28, score-0.337]

15 In the next section, we review background work on exploration in Markov decision processes. [sent-31, score-0.448]

16 Then we present our approaches for exploration based on an empirical estimate of the future model learning progress. [sent-32, score-0.538]

17 2 Background on Exploration We model the interaction of an agent with its environment as a Markov decision process (MDP). [sent-35, score-0.223]

18 An MDP is a discrete-time stochastic control process where at each time-step the process is in one of a ﬁxed set S of states and the agent can choose an action from a set A. [sent-36, score-0.359]

19 The transition model T speciﬁes the conditional transition distribution T (s | a, s) over successor states s when executing an action a in a given state s. [sent-37, score-0.595]

20 In unstructured ﬁnite state and action spaces T can be deﬁned by separate multinomial distributions T (s, a) over successor states for each state-action pair s, a. [sent-38, score-0.448]

21 The agent receives rewards in states according to a function R : S → R. [sent-39, score-0.343]

22 A policy π : S → A speciﬁes for each state the action to take. [sent-40, score-0.343]

23 The goal of planning in an MDP is to ﬁnd the optimal policy π ∗ which maximizes the expected future rewards E[R]. [sent-41, score-0.275]

24 The value of state s when acting according to policy π is deﬁned as the expected future rewards when starting from s, V π (s) = E[R | s1 = s, π]. [sent-43, score-0.327]

25 The optimal policy π ∗ can be found by classical algorithms such as value iteration or policy iteration. [sent-44, score-0.397]

26 In reinforcement learning, the agent does not know the transition model T . [sent-45, score-0.474]

27 In a model-based approach, the agent ˆ estimates T from its interaction trace with the environment ∆ = s1 , r1 , a1 , . [sent-46, score-0.223]

28 A simple approach to the exploitation-exploration tradeoff is -greedy: the agent performs a random action for exploration with probability and ˆ exploits otherwise by executing a greedy policy with respect to T . [sent-51, score-0.88]

29 If decreases over time towards 0, -greedy exploration converges to π ∗ . [sent-52, score-0.448]

30 In the PAC-MDP (probably approximately correct) framework, the efﬁciency of an exploration algorithm A is measured by its sample complexity [7]. [sent-54, score-0.448]

31 The model-based RL algorithms E 3 [8] and R- MAX [4] are PAC-MDP efﬁcient exploration methods for unstructured ﬁnite state and action spaces: their sample complexity scales polynomially in |S| and |A|. [sent-57, score-0.752]

32 E 3 and R- MAX share the central concept of known states and actions which have been observed sufﬁciently often. [sent-58, score-0.198]

33 ) If the visitation count c(s, a) of a state-action pair s, a is larger ˆ than some threshold m, the estimate T (s, a) is guaranteed to be with high probability -close to the true model. [sent-60, score-0.22]

34 Following a policy in these known states achieves approximately the same rewards ˆ in the learned model T and the true model T . [sent-61, score-0.386]

35 PAC-MDP approaches like R- MAX ignore the current empirical progress in learning T : the threshold m is ﬁxed 2 a-priori and remains the same for all s, a independently of the agent’s experiences or its estimated relevance of s, a for large rewards. [sent-69, score-0.465]

36 Here, the agent ˆ takes its uncertainty about the learned model T explicitly into account. [sent-71, score-0.199]

37 More formally, the agent maintains a posterior belief b over all possible transition models T given its previous experiences ∆ and a prior. [sent-73, score-0.429]

38 The value function for a deterministic policy π(b, s) is deﬁned in terms of the state s and the belief state b and fulﬁlls V π (b, s) = R(s, π(b, s)) + T (b , s | b, s, π(b, s)) V π (b , s ) . [sent-74, score-0.408]

39 (1) b ,s The optimal Bayesian policy π ∗ = argmaxπ V π (b, s) solves the exploitation-exploration tradeoff implicitly: π ∗ considers how actions affect not only the state of the world, but also the agent’s internal belief about the world. [sent-75, score-0.404]

40 In a Bayesian RL context for a ﬁnite horizon H, the sample complexity of an algorithm A can be deﬁned as the number of time-steps when following A where its policy πA ∗ A π πt at time t is not -Bayesian-optimal, that is, where VH t (bt , st ) < VH (bt , st ) − . [sent-76, score-0.347]

41 A recent approximate solution to Bayesian RL for unstructured ﬁnite representations is the Bayesian exploration bonus (BEB) [9] which resembles closely MBIE-EB [15]. [sent-78, score-0.679]

42 s α(s,a,s ˆ BEB avoids reasoning in the belief space: it solves an MDP built from the mean estimate Tb using an additional exploration bonus β/(1 + c(s, a)) to reward state-action pairs inversely according to their visitation counts c(s, a). [sent-80, score-0.908]

43 However, their bounds do not apply to changing transition dynamics and it remains unanswered how to incorporate them in efﬁcient exploration algorithms. [sent-84, score-0.65]

44 In a wider context of RL and developmental robotics, many strategies for efﬁcient exploration have been subsumed by the concept of intrinsic motivation [10, 13] which is also termed fun or curiosity [12]. [sent-85, score-0.661]

45 Many of these approaches take empirical learning progress into account. [sent-86, score-0.261]

46 This includes methods that estimate from the agent’s experience the amount of potential learning progress in different regions of the state space. [sent-87, score-0.285]

47 Thereby, exploration focuses on those areas where learning progress can indeed be made: areas which are neither already well-understood nor currently too difﬁcult to learn. [sent-88, score-0.648]

48 So far, however, guarantees about the sample complexity of intrinsic motivation based exploration approaches have been missing. [sent-92, score-0.607]

49 In this paper, we take up the ideas of intrinsic motivation to extend the theoretically founded exploration approaches described above. [sent-93, score-0.607]

50 In those problems, we know that after a ﬁxed number of visits to a state its estimated transition model becomes approximately correct and we can perform exact belief updates to guarantee Bayesian optimality. [sent-95, score-0.48]

51 In the following, we present extensions which rely instead on previous exploration and learning experience to estimate in which parts of the state and action space further exploration is promising and where not. [sent-96, score-1.141]

52 in domains where a polynomial KWIK learner is not available), or when the transition dynamics change over time. [sent-99, score-0.304]

53 1 Exploration Driven by Learning Progress Let ζ : S × A → R denote a measure for the expected learning and exploration progress when visiting a state-action pair s, a. [sent-102, score-0.619]

54 Hence, an exploration strategy based on ζ needs to re-estimate ζ with each new experience. [sent-105, score-0.448]

55 ζ-R- MAX acts greedily with respect to the optimal policy for the reward function Rζ-R- MAX (s, a) = R(s, a) ζ(s, a) < m Rmax else . [sent-108, score-0.285]

56 (2) Instead of rewarding arbitrary states with low visitation counts (considered unknown) directly as in R- MAX, ζ-R- MAX gets large reward for exploring such state-action pairs where the expected learning progress is large. [sent-109, score-0.564]

57 ζ-EB acts greedily with respect to the optimal policy for the reward function Rζ-EB (s, a) = R(s, a) + β 1+ √ 1 ζ(s,a) (3) for some constant β. [sent-111, score-0.285]

58 Instead of setting the exploration bonus directly proportional to visitation counts as in BEB , ζ-EB gets a bonus for exploring state-actions pairs where the expected learning progress is large. [sent-112, score-1.061]

59 The idea of using expected learning progress to drive exploration is that we can estimate ζ empirically from the interaction data ∆ = s1 , a1 , r1 , s2 , . [sent-113, score-0.746]

60 In general, we assume that learning T (s, a) implies minimizing the loss L(T (s, a); Ds,a ) where ns,a ˆ Ds,a = {si }i=1 are the successor states in the transitions (s, a, si ) from s, a in ∆. [sent-124, score-0.251]

61 (5) An empirical estimator of the PE based on the available data Ds,a is the leave-one-out crossvalidation estimator 1 loo loo ˆ −s ; {s }) CV (Ds,a , s, a) = (6) s , s := L(T |Ds,a | s ∈D 4 −s ˆ where T −s is the model learned from data Ds,a = Ds,a \ {s }. [sent-128, score-0.566]

62 Putting an absolute threshold directly on the loss to decide whether a state is known or unknown ˆ is hard. [sent-129, score-0.204]

63 Therefore, we propose to drive exploration based on the learning progress instead of the current learner accuracy. [sent-131, score-0.776]

64 The ζ(s, a) we use in concrete exploration algorithms ζ-EB and ζ-R- MAX includes an additional variance margin, ˆ ζ(s, a) := ζ(s, a) + α ν(s, a) , (8) where ν(s, a) is an estimate of the CV variance (discussed in more detail below), ν(s, a) = 1 |Ds,a | [ loo s − CV (Ds,a , s, a)]2 . [sent-140, score-0.698]

65 3 Guarantees on the Exploration Efﬁciency As discussed in the introduction, we propose our extensions of R- MAX and BEB to target scenarios which go beyond the standard setting of stationary unstructured, ﬁnite state and action spaces. [sent-143, score-0.328]

66 In this subsection, however, we go back and consider the behavior of our extensions exactly under these classical assumptions—this is meant as a sanity check to ensure that our extensions inherit the standard exploration efﬁciency properties under standard assumptions. [sent-144, score-0.667]

67 We will directly relate a threshold on the empirical ζ(s, a) to a threshold on model accuracy. [sent-145, score-0.233]

68 Unfortunately, the agent only has access to an empirically estimate. [sent-157, score-0.208]

69 Its variance can be described by ﬁrst considering the covariance matrix C of the vector ( loo )s ∈Ds,a under random Ds,a . [sent-160, score-0.202]

70 The diagonal entries are the s variances VarD { loo } of each single loo under random data, which are independent of s (assuming s s i. [sent-161, score-0.308]

71 The off-diagonal entries of C capture the correlations between different loo and loo and are constant (due to i. [sent-166, score-0.308]

72 (12) Now that we have a conﬁdence measure on the estimator we can show that a threshold on the ˆ ˆ : empirical estimator ζ(s, a) = ζ(s, a) + α ν(s, a) implies a threshold on the mean ζ(s, a) Ds,a Lemma 3. [sent-174, score-0.381]

73 Finally, we show that our exploration method using the empirical ζ(s, a) to drive exploration is PAC-MDP efﬁcient. [sent-179, score-1.032]

74 There is a threshold m such that ζ-R- MAX using a Dirichlet learner in the standard setting of stationary unstructured, ﬁnite state and action spaces is PAC-MDP efﬁcient. [sent-181, score-0.382]

75 From Lemma 3 we know that a threshold on our empirical measure implies a threshold on ˆ the mean measure. [sent-183, score-0.28]

76 4 Evaluation We compare the empirical performance of our exploration algorithms ζ-R- MAX and ζ-EB with the performance of R- MAX, BEB and simple model-based -greedy exploration with optimistic initialization in unstructured ﬁnite state and action spaces. [sent-187, score-1.217]

77 R- MAX assumes to know correct thresholds m for the number of visits to states to 6 ensure accurate transition models. [sent-191, score-0.448]

78 We simulate satisﬁed or violated assumptions of R- MAX by setting individual thresholds for states: in general, states with higher noise require more samples to achieve the same level of model accuracy as states with low noise. [sent-192, score-0.485]

79 (c) And are our approaches more robust to unexpected changes in the transition dynamics? [sent-196, score-0.208]

80 There is a single goal state with a reward of 1 (marked with “G”) and several states with negative rewards −0. [sent-199, score-0.326]

81 The lighter states are noisy states: their actions have less predictable effects. [sent-201, score-0.211]

82 The transition probabilities of the noisy states are sampled from a Dirichlet distribution with parameters α = 0. [sent-202, score-0.242]

83 To ﬁnd this optimal path and avoid local minima, an exploration algorithm needs to explore sufﬁciently and estimate the state values accurately. [sent-207, score-0.558]

84 We evaluate the performance of the algorithms in terms of the reward collected ∗ ˆ in the true model T using the optimal policy πT for their learned model T . [sent-208, score-0.314]

85 In our results, we report ˆ ∗ ∗ the policy value error deﬁned as VT (sI ; πT ) − VT (sI ; πT ) in the value of the start state sI with ˆ respect to the optimal policy π ∗ . [sent-209, score-0.478]

86 Experiment 1: Correct Assumptions In our ﬁrst experiment, the assumptions of BEB and RMAX are fulﬁlled: BEB is given the correct prior; R- MAX uses appropriate thresholds for states (depending on the state stochasticity). [sent-213, score-0.419]

87 1(b) show that our exploration methods ζ-R- MAX and ζ-EB achieve similar performance to the original methods, even without having a correct prior or state-dependent thresholds. [sent-216, score-0.575]

88 In contrast, -greedy does not ﬁnd the optimal policy in reasonable time. [sent-218, score-0.213]

89 1(c) show that as expected R- MAX and BEB do not converge any longer to the optimal policy: they explore states too long whose dynamics are already well estimated, while neglecting states which require more samples for an accurate model estimation. [sent-225, score-0.355]

90 Experiment 3: Change in Dynamics In our ﬁnal experiment, the transition dynamics for a randomly chosen state along the optimal path get permuted after 900 steps. [sent-227, score-0.342]

91 1(d) shows, R- MAX and BEB with correct assumptions for the original problem (before time-step 900) cannot compensate for this as they base their estimate of the model certainty only on the visitation counts, but do not look at the data itself. [sent-229, score-0.354]

92 In contrast, ζ-R- MAX and ζ-EB detect such an unexpected event and can refocus their exploration efforts. [sent-230, score-0.529]

93 5 Conclusions and Future Work We have proposed to drive exploration in model-based reinforcement learning using the estimated future progress in model learning. [sent-231, score-0.879]

94 (b) Like R- MAX and BEB with correct assumptions, our algorithms ζ-R- MAX and ζ-EB based on an empirical estimation of the learning progress converge to the optimal policy without relying on these assumptions, but take a small extra amount of time. [sent-235, score-0.549]

95 (c) When their assumptions are violated, RMAX and BEB fail to converge, while ζ-R- MAX and ζ-EB don’t rely on these assumptions and again ﬁnd the optimal policy. [sent-236, score-0.223]

96 (d) In contrast to existing methods, ζ-R- MAX and ζ-EB can cope with the change in transition dynamics after 900 steps and refocus their exploration. [sent-237, score-0.291]

97 algorithms can be deﬁned which do not rely on correct prior knowledge and can cope with changing transition dynamics. [sent-238, score-0.3]

98 It is also worth to investigate in more depth the relation of our approach to the general concept of intrinsic motivation as proposed in developmental robotics. [sent-241, score-0.213]

99 In our view, a combination of methods which trades off both strong prior assumptions together with empirical estimates of the learning progress seems to be the most promising direction for future work on exploration in the real world. [sent-242, score-0.805]

100 Active learning of inverse models with intrinsically motivated goal exploration in robots. [sent-249, score-0.535]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('beb', 0.481), ('exploration', 0.448), ('policy', 0.184), ('progress', 0.171), ('agent', 0.17), ('loo', 0.154), ('rl', 0.144), ('transition', 0.131), ('visitation', 0.127), ('reinforcement', 0.126), ('bonus', 0.116), ('unstructured', 0.115), ('max', 0.115), ('states', 0.111), ('cv', 0.102), ('assumptions', 0.097), ('threshold', 0.093), ('violated', 0.09), ('drive', 0.089), ('correct', 0.085), ('counts', 0.083), ('state', 0.081), ('action', 0.078), ('sanity', 0.077), ('rmax', 0.077), ('estimator', 0.074), ('reward', 0.072), ('dynamics', 0.071), ('learner', 0.068), ('bayesian', 0.067), ('intrinsic', 0.067), ('experiences', 0.066), ('successor', 0.063), ('belief', 0.062), ('stationary', 0.062), ('rewards', 0.062), ('st', 0.06), ('developmental', 0.058), ('intrinsically', 0.058), ('scenarios', 0.054), ('environment', 0.053), ('extensions', 0.053), ('lighter', 0.052), ('mdp', 0.052), ('ful', 0.051), ('motivation', 0.049), ('actions', 0.048), ('variance', 0.048), ('curriculum', 0.047), ('refocus', 0.047), ('visitations', 0.047), ('si', 0.047), ('know', 0.047), ('empirical', 0.047), ('thresholds', 0.045), ('estimated', 0.045), ('certainty', 0.045), ('horizon', 0.043), ('approaches', 0.043), ('cope', 0.042), ('prior', 0.042), ('concept', 0.039), ('bordeaux', 0.039), ('vh', 0.039), ('curious', 0.039), ('fard', 0.039), ('oudeyer', 0.039), ('empirically', 0.038), ('dirichlet', 0.038), ('autonomous', 0.037), ('check', 0.036), ('rgen', 0.036), ('lang', 0.036), ('satinder', 0.036), ('tobias', 0.034), ('unexpected', 0.034), ('pac', 0.034), ('darker', 0.034), ('crossvalidation', 0.034), ('tb', 0.034), ('polynomial', 0.034), ('converge', 0.033), ('robotics', 0.033), ('experience', 0.033), ('toussaint', 0.033), ('strehl', 0.033), ('stochasticity', 0.033), ('yoshua', 0.033), ('fu', 0.032), ('accuracy', 0.031), ('unbiased', 0.03), ('loss', 0.03), ('permuted', 0.03), ('polynomially', 0.03), ('currently', 0.029), ('motivated', 0.029), ('optimal', 0.029), ('visits', 0.029), ('learned', 0.029), ('experiment', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

2 0.28899255 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

3 0.19704035 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

4 0.19235475 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The ﬁeld has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functﬁon. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

5 0.18226781 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

Abstract: We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efﬁcient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without predeﬁned mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains. 1

6 0.16611272 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

7 0.16537978 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

8 0.16131794 38 nips-2012-Algorithms for Learning Markov Field Policies

9 0.15804318 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

10 0.15668905 348 nips-2012-Tractable Objectives for Robust Policy Optimization

11 0.14741832 344 nips-2012-Timely Object Recognition

12 0.14473766 160 nips-2012-Imitation Learning by Coaching

13 0.1421452 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

14 0.12819546 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

15 0.12616037 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

16 0.12583706 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

17 0.11419262 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

18 0.10989828 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

19 0.099445477 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

20 0.091284126 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.209), (1, -0.362), (2, -0.024), (3, -0.023), (4, -0.067), (5, 0.025), (6, -0.001), (7, 0.013), (8, -0.018), (9, -0.004), (10, -0.005), (11, -0.019), (12, 0.001), (13, -0.028), (14, -0.014), (15, -0.039), (16, 0.045), (17, 0.049), (18, -0.032), (19, 0.0), (20, 0.003), (21, 0.009), (22, -0.007), (23, 0.025), (24, -0.007), (25, 0.025), (26, 0.07), (27, -0.025), (28, 0.051), (29, 0.009), (30, 0.066), (31, 0.015), (32, 0.01), (33, -0.001), (34, -0.003), (35, 0.029), (36, 0.012), (37, 0.062), (38, 0.066), (39, -0.054), (40, -0.018), (41, 0.001), (42, 0.059), (43, -0.033), (44, 0.033), (45, -0.063), (46, 0.066), (47, 0.063), (48, -0.12), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95170438 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

2 0.8952449 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

3 0.84968054 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

4 0.82047004 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

5 0.79987961 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We deﬁne priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to ﬁnd an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually speciﬁed, leading to hierarchically optimal rather than recursively optimal policies. 1

6 0.76232475 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

7 0.74555397 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

8 0.72685194 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

9 0.71292508 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

10 0.70662534 38 nips-2012-Algorithms for Learning Markov Field Policies

11 0.7050119 348 nips-2012-Tractable Objectives for Robust Policy Optimization

12 0.68128288 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

13 0.67057186 160 nips-2012-Imitation Learning by Coaching

14 0.66990662 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

15 0.66685307 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

16 0.64839363 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

17 0.62742913 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

18 0.6170916 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

19 0.60618049 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

20 0.59959632 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (21, 0.053), (38, 0.131), (42, 0.024), (53, 0.013), (54, 0.1), (55, 0.011), (74, 0.048), (76, 0.138), (80, 0.108), (91, 0.205), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.885234 46 nips-2012-Assessing Blinding in Clinical Trials

Author: Ognjen Arandjelovic

Abstract: The interaction between the patient’s expected outcome of an intervention and the inherent effects of that intervention can have extraordinary effects. Thus in clinical trials an effort is made to conceal the nature of the administered intervention from the participants in the trial i.e. to blind it. Yet, in practice perfect blinding is impossible to ensure or even verify. The current standard is follow up the trial with an auxiliary questionnaire, which allows trial participants to express their belief concerning the assigned intervention and which is used to compute a measure of the extent of blinding in the trial. If the estimated extent of blinding exceeds a threshold the trial is deemed sufﬁciently blinded; otherwise, the trial is deemed to have failed. In this paper we make several important contributions. Firstly, we identify a series of fundamental problems of the aforesaid practice and discuss them in context of the most commonly used blinding measures. Secondly, motivated by the highlighted problems, we formulate a novel method for handling imperfectly blinded trials. We too adopt a post-trial feedback questionnaire but interpret the collected data using an original approach, fundamentally different from those previously proposed. Unlike previous approaches, ours is void of any ad hoc free parameters, is robust to small changes in auxiliary data and is not predicated on any strong assumptions used to interpret participants’ feedback. 1

2 0.84370214 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

Author: Wei Bi, James T. Kwok

Abstract: In hierarchical classiﬁcation, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classiﬁcation, performing MLNP in hierarchical multilabel classiﬁcation is much more difﬁcult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efﬁciently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and ﬂat multilabel classiﬁcation methods. 1

same-paper 3 0.82549638 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

4 0.76641446 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a ﬂexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efﬁcient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

5 0.76122361 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difﬁcult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a beneﬁt from setting ﬂexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

6 0.76100928 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

7 0.75564796 344 nips-2012-Timely Object Recognition

8 0.74827427 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

9 0.74804872 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

10 0.74228907 38 nips-2012-Algorithms for Learning Markov Field Policies

11 0.74174154 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

12 0.74116653 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.73962456 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

14 0.73863989 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

15 0.73852503 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

16 0.73715508 160 nips-2012-Imitation Learning by Coaching

17 0.73546797 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

18 0.73102432 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

19 0.73015821 292 nips-2012-Regularized Off-Policy TD-Learning

20 0.72881532 348 nips-2012-Tractable Objectives for Robust Policy Optimization