nips nips2012 nips2012-108 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. [sent-11, score-0.207]
2 In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. [sent-13, score-0.332]
3 Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. [sent-14, score-0.397]
4 1 Introduction A key objective in the theory of Markov Decision Processes (MDPs) is to maximize the expected sum of discounted rewards when the dynamics of the MDP are (perhaps partially) unknown. [sent-16, score-0.191]
5 The discount factor pressures the agent to favor short-term rewards, but potentially costly exploration may identify better rewards in the long-term. [sent-17, score-0.311]
6 The agent starts in the belief state corresponding to its prior and, by executing the greedy policy in the BAMDP whilst updating its posterior, acts optimally (with respect to its beliefs) in the original MDP. [sent-21, score-0.378]
7 In this framework, rich prior knowledge about statistics of the environment can be naturally incorporated into the planning process, potentially leading to more efficient exploration and exploitation of the uncertain world. [sent-22, score-0.382]
8 Here, we present a tractable approach that exploits and extends recent advances in Monte-Carlo tree search (MCTS) [16, 20], but avoiding problems associated with applying MCTS directly to the BAMDP. [sent-25, score-0.236]
9 This MDP is used to simulate a single episode whose outcome is used to update the value of each node of the search tree traversed during the simulation. [sent-27, score-0.372]
10 1 Our algorithm is more efficient than previous sparse sampling methods for Bayes-adaptive planning [25, 6, 2], partly because it does not update the posterior belief state during the course of each simulation. [sent-33, score-0.416]
11 Consequently, our algorithm is particularly well suited to support planning in domains with richly structured prior knowledge — a critical requirement for applications of Bayesian reinforcement learning to large problems. [sent-35, score-0.335]
12 An MDP is described as a 5-tuple M = S, A, P, R, γ , where S is the set of states, A is the set of actions, P : S × A × S → R is the state transition probability kernel, R : S × A → R is a bounded reward function, and γ is the discount factor [23]. [sent-38, score-0.233]
13 When all the components of the MDP tuple are known, standard MDP planning algorithms can be used to estimate the optimal value function and policy off-line. [sent-39, score-0.327]
14 After observing a history of actions and states ht = s1 a1 s2 a2 . [sent-41, score-0.171]
15 at−1 st from the MDP, the posterior belief on P is updated using Bayes’ rule P (P|ht ) ∝ P (ht |P)P (P). [sent-44, score-0.182]
16 The uncertainty about the dynamics of the model can be transformed into uncertainty about the current state inside an augmented state space S + = S ×H, where S is the state space in the original problem and H is the set of possible histories. [sent-45, score-0.346]
17 The dynamics associated with this augmented state space are described by P + ( s, h , a, s , h ) = 1[h = has ] P(s, a, s )P (P|h) dP, R+ ( s, h , a) = R(s, a) (1) P Together, the 5-tuple M + = S + , A, P + , R+ , γ forms the Bayes-Adaptive MDP (BAMDP) for the MDP problem M . [sent-46, score-0.15]
18 Since the dynamics of the BAMDP are known, it can in principle be solved to obtain the optimal value function associated with each action: ∞ Q∗ ( st , ht , a) = max Eπ π γ t −t rt |at = a (2) t =t from which the optimal action for each state can be readily derived. [sent-47, score-0.411]
19 1 Optimal actions in the BAMDP are executed greedily in the real MDP M and constitute the best course of action for a Bayesian agent with respect to its prior belief over P. [sent-48, score-0.36]
20 It is obvious that the expected performance of the BAMDP policy in the MDP M is bounded above by that of the optimal policy obtained with a fullyobservable model, with equality occurring, for example, in the degenerate case in which the prior only has support on the true model. [sent-49, score-0.326]
21 1 The BAMCP algorithm Algorithm Description The goal of a BAMDP planning method is to find, for each decision point s, h encountered, the action a that maximizes Equation 2. [sent-51, score-0.323]
22 We avoid this cost by only sampling a single transition model P i from the posterior at the root of the search tree at the 1 The redundancy in the state-history tuple notation — st is the suffix of ht — is only present to ensure clarity of exposition. [sent-59, score-0.642]
23 Sample-based tree search then acts as a filter, ensuring that the correct distribution of state successors is obtained at each of the tree nodes, as if it was sampled from P + . [sent-61, score-0.474]
24 2 BA-UCT with Root Sampling The root node of the search tree at a decision point represents the current state of the BAMDP. [sent-64, score-0.44]
25 The tree is composed of state nodes representing belief states s, h and action nodes representing the effect of particular actions from their parent state node. [sent-65, score-0.553]
26 The visit counts: N ( s, h ) for state nodes, and N ( s, h , a) for action nodes, are initialized to 0 and updated throughout search. [sent-66, score-0.182]
27 Each simulation traverses the tree without backtracking by following the UCT policy at state nodes defined by argmaxa Q( s, h , a) + c log(N ( s, h ))/N ( s, h , a), where c is an exploration constant that needs to be set appropriately. [sent-68, score-0.59]
28 That is, at action node ( s, h , a), s is sampled from P i (s, a, ·), and the new state node is set to s , has . [sent-70, score-0.328]
29 When a simulation reaches a leaf, the tree is expanded by attaching a new state node with its connected action nodes, and a rollout policy πro is used to control the MDP defined by the current P i to some fixed depth (determined using the discount factor). [sent-71, score-0.882]
30 The rollout provides an estimate of the value Q( s, h , a) from the leaf action node. [sent-72, score-0.32]
31 , the mean of the sampled returns obtained from that action node over the simulations). [sent-75, score-0.204]
32 The tree policy treats the forward search as a meta-exploration problem, preferring to exploit regions of the tree that currently appear better than others while continuing to explore unknown or less known parts of the tree. [sent-78, score-0.521]
33 Nevertheless all parts of the tree are eventually visited infinitely often, and therefore the algorithm will eventually converge on the Bayes-optimal policy (see Section 3. [sent-80, score-0.285]
34 3 Lazy Sampling In previous work on sample-based tree search, indeed including POMCP [20], a complete sample state is drawn from the posterior at the root of the search tree. [sent-88, score-0.393]
35 Consider P(s, a, ·) to be parametrized by a latent variable θs,a for each state and action pair. [sent-91, score-0.161]
36 For each simulation i, we sample P (φ|ht ) at the root and then lazily sample the θst ,at parameters as required, conditioned on φ and all Θt−1 parameters sampled for the current simulation. [sent-99, score-0.198]
37 For example, if the transition parameters for different states and actions are independent, we can completely forgo sampling a complete P, and instead draw any necessary parameters individually for each state-action pair. [sent-101, score-0.191]
38 4 Rollout Policy Learning The choice of rollout policy πro is important if simulations are few, especially if the domain does not display substantial locality or if rewards require a carefully selected sequence of actions to be obtained. [sent-105, score-0.609]
39 A rollout policy in BAMCP following Qro could therefore over-exploit. [sent-111, score-0.357]
40 Instead, similar to [13], we select an -greedy policy with respect to Qro as our rollout policy πro . [sent-112, score-0.497]
41 This method provides valuable direction for the rollout policy at negligible computational cost. [sent-114, score-0.357]
42 More complex rollout policies can be considered, for example rollout policies that depend on the sampled model P i . [sent-115, score-0.469]
43 c > Rmax ), from state st , ht , BAMCP constructs a value function at the root 1−γ p node that converges in probability to an -optimal value function, V ( st , ht ) → V ∗ ( st , ht ), where = 1−γ . [sent-122, score-0.717]
44 Moreover, for large enough N ( st , ht ), the bias of V ( st , ht ) decreases as O(log(N ( st , ht ))/N ( st , ht )). [sent-123, score-0.728]
45 Given limited space, we do not provide a comprehensive list of planning algorithms for MDP exploration, but rather concentrate on related sample-based algorithms for Bayesian RL. [sent-127, score-0.187]
46 Sparse sampling [15] is a sample-based tree search algorithm. [sent-137, score-0.302]
47 The key idea is to sample successor nodes from each state, and apply a Bellman backup to update the value of the parent node from the values of the child nodes. [sent-138, score-0.168]
48 The tree is expanded non-uniformly according to the sampled trajectories. [sent-141, score-0.217]
49 At each decision node, a promising action is selected using Thompson sampling — i. [sent-142, score-0.202]
50 At each chance node, a successor belief-state is sampled from the transition dynamics of the belief-state MDP. [sent-145, score-0.191]
51 Although they described their algorithm as MonteCarlo tree search, it in fact uses a Bellman backup rather than Monte-Carlo evaluation. [sent-147, score-0.183]
52 , the tree is expanded non-uniformly according to the sampled trajectories, albeit using a different method for action selection. [sent-150, score-0.32]
53 At each decision node, a promising action is selected by maximising the upper bound on value. [sent-151, score-0.163]
54 Bayesian Exploration Bonus (BEB) solves the posterior mean MDP, but with an additional reward bonus that depends on visitation counts [17]. [sent-153, score-0.225]
55 propose an algorithm with a different form of exploration bonus [21]. [sent-155, score-0.201]
56 However, behavior in the early steps of exploration is very sensitive to the precise exploration bonuses; and it turns out to be hard to translate sophisticated prior knowledge into the form of a bonus. [sent-157, score-0.312]
57 For each algorithm, we report the mean sum of rewards and confidence interval for the best performing parameter within a reasonable planning time limit (0. [sent-159, score-0.273]
58 For BAMCP, this simply corresponds to the number of simulations that achieve a planning time just under the imposed limit. [sent-162, score-0.245]
59 The algorithm was run for different number of simulations (10 to 10000) to span different planning times. [sent-186, score-0.245]
60 The depth of search was set to 15 in all domains except for the larger grid and maze domain where it was set to 50. [sent-197, score-0.477]
61 The Double-loop domain is a 9-state deterministic MDP with 2 actions [8], 1000 steps are executed in this domain. [sent-202, score-0.164]
62 Grid5 is a 5 × 5 grid with no reward anywhere except for a reward state opposite to the reset state. [sent-203, score-0.284]
63 Dearden’s Maze is a 264-states maze with 3 flags to collect [8]. [sent-207, score-0.213]
64 A special reward state gives the number of flags collected since the last visit as reward, 20000 steps are executed in this domain. [sent-208, score-0.228]
65 In fact, we are optimising a different criterion – the discounted reward from the start state – and so we might expect this evaluation to be unfavourable to our algorithm. [sent-211, score-0.188]
66 For the grids and the maze domain, the algorithms were run with a sparse Dirichlet-Multinomial model, as described in [11]. [sent-214, score-0.213]
67 For both of these models, efficient collapsed sampling schemes are available; they are employed for the BA-UCT and BFS3 algorithms in our experiments to compress the posterior parameter sampling and the transition sampling into a single transition sampling step. [sent-215, score-0.434]
68 This considerably reduces the cost of belief updates inside the search tree when using these simple probabilistic models. [sent-216, score-0.313]
69 Figure 1 reports the planning time/performance trade-off for the different algorithms on the Grid5 and Maze domain. [sent-220, score-0.187]
70 This is particularly evident for BEB, which required a different value of exploration bonus to achieve maximum performance in each domain. [sent-223, score-0.201]
71 BAMCP’s performance scales well as a function of planning time, as is evident in Figure 1. [sent-225, score-0.187]
72 BEB cannot take advantage of prolonged planning time at all. [sent-228, score-0.187]
73 BFS3 generally scales up with more planning time with an appropriate choice of parameters, but it is not obvious how to trade-off the branching factor, depth, and number of simulations in each domain. [sent-229, score-0.275]
74 BAMCP greatly benefited from our lazy 2 The result reported for Dearden’s maze with the Bayesian DP alg. [sent-230, score-0.293]
75 in [22] is for a different version of the task in which the maze layout is given to the agent. [sent-231, score-0.213]
76 ) and Maze domain (b-d) as a function of planning time. [sent-233, score-0.25]
77 On the Maze domain, performance of vanilla BA-UCT with and without rollout policy learning (RL). [sent-244, score-0.378]
78 On the Maze domain, performance of BAMCP with and without the lazy sampling (LS) and rollout policy learning (RL) presented in Sections 3. [sent-246, score-0.503]
79 sampling scheme in the experiments, providing 35× speed improvement over the naive approach in the maze domain for example; this is illustrated in Figure 1(c). [sent-250, score-0.342]
80 Dearden’s maze aptly illustrates a major drawback of forward search sparse sampling algorithms such as BFS3. [sent-251, score-0.37]
81 Like many maze problems, all rewards are zero for at least k steps, where k is the solution length. [sent-252, score-0.299]
82 Without prior knowledge of the optimal solution length, all upper bounds will be higher than the true optimal value until the tree has been fully expanded up to depth k – even if a simulation happens to solve the maze. [sent-253, score-0.321]
83 In contrast, once BAMCP discovers a successful simulation, its Monte-Carlo evaluation will immediately bias the search tree towards the successful trajectory. [sent-254, score-0.236]
84 The probability of grid cell ij having a reward of 1 is pi qj , otherwise the reward is 0. [sent-258, score-0.226]
85 , observing a state transition provides information about other state transitions). [sent-263, score-0.175]
86 The performance during the first 200 steps in the environment is averaged over 50 sampled environments (5 runs for each sample) and is reported both in terms of undiscounted (left) and discounted (right) sum of rewards. [sent-268, score-0.18]
87 , BA-UCT, BFS3) can deal with the state space, but not the large belief space: at every node of the search tree they must solve an approximate inference problem to estimate the posterior beliefs. [sent-278, score-0.465]
88 In contrast, BAMCP limits the posterior inference to the root of the search tree and is not directly affected by the size of the state space or belief space, which allows the algorithm to perform well even with a limited planning time. [sent-279, score-0.633]
89 Note that lazy sampling is required in this setup since a full sample of the dynamics involves infinitely many parameters. [sent-280, score-0.214]
90 Figure 2 (and Figure S5) demonstrates the planning performance of BAMCP in this complex domain. [sent-281, score-0.187]
91 Performance improves with additional planning time, and the quality of the prior clearly affects the agent’s performance. [sent-282, score-0.233]
92 It is possible to construct malicious environments, for example in which the optimal policy is hidden in a generally low reward region of the tree, where UCT can be misled for long periods [7]. [sent-286, score-0.233]
93 Second, the UCT algorithm treats every action node as a multi-armed bandit problem. [sent-287, score-0.212]
94 However, there is no actual benefit to accruing reward during planning, and so it is in theory more appropriate to use pure exploration bandits [4]. [sent-288, score-0.259]
95 If this prior knowledge matches the class of environments that the agent will encounter, then exploration could be significantly accelerated. [sent-293, score-0.279]
96 The main idea is to employ Monte-Carlo tree search to explore the augmented Bayes-adaptive search space efficiently. [sent-297, score-0.351]
97 The naive implementation of that idea is the proposed BA-UCT algorithm, which cannot scale for most priors due to expensive belief updates inside the search tree. [sent-298, score-0.191]
98 A Bayesian sampling approach to exploration in reinforcement learning. [sent-307, score-0.245]
99 The grand chala lenge of computer Go: Monte Carlo tree search and extensions. [sent-370, score-0.236]
100 A sparse sampling algorithm for near-optimal planning in large Markov decision processes. [sent-390, score-0.286]
wordName wordTfidf (topN-words)
[('bamcp', 0.639), ('rollout', 0.217), ('maze', 0.213), ('uct', 0.213), ('planning', 0.187), ('bamdp', 0.18), ('mdp', 0.164), ('sboss', 0.148), ('tree', 0.145), ('policy', 0.14), ('beb', 0.13), ('exploration', 0.121), ('ht', 0.105), ('action', 0.103), ('rl', 0.1), ('reward', 0.093), ('search', 0.091), ('rewards', 0.086), ('agent', 0.081), ('bonus', 0.08), ('lazy', 0.08), ('st', 0.077), ('ro', 0.075), ('boss', 0.072), ('bayesian', 0.071), ('dynamics', 0.068), ('simulation', 0.067), ('node', 0.066), ('sampling', 0.066), ('qro', 0.066), ('domain', 0.063), ('dearden', 0.063), ('ba', 0.06), ('transition', 0.059), ('reinforcement', 0.058), ('state', 0.058), ('simulations', 0.058), ('undiscounted', 0.053), ('belief', 0.053), ('rs', 0.052), ('posterior', 0.052), ('asmuth', 0.049), ('lazily', 0.049), ('root', 0.047), ('prior', 0.046), ('actions', 0.045), ('domains', 0.044), ('bandit', 0.043), ('grid', 0.04), ('traversed', 0.04), ('ls', 0.038), ('silver', 0.038), ('backup', 0.038), ('discounted', 0.037), ('expanded', 0.037), ('dp', 0.036), ('sampled', 0.035), ('nodes', 0.035), ('bellman', 0.034), ('decision', 0.033), ('castro', 0.033), ('gelly', 0.033), ('gittins', 0.033), ('mcts', 0.033), ('pomcp', 0.033), ('sorg', 0.033), ('mdps', 0.032), ('rmax', 0.032), ('executed', 0.032), ('environments', 0.031), ('branching', 0.03), ('simulate', 0.03), ('successor', 0.029), ('uncertainty', 0.028), ('return', 0.028), ('exploitation', 0.028), ('beliefs', 0.027), ('maximising', 0.027), ('rollouts', 0.027), ('depth', 0.026), ('kocsis', 0.025), ('cardinal', 0.025), ('augmented', 0.024), ('pure', 0.024), ('ags', 0.024), ('traverses', 0.024), ('szepesv', 0.024), ('steps', 0.024), ('inside', 0.024), ('discount', 0.023), ('priors', 0.023), ('beta', 0.022), ('effort', 0.022), ('bandits', 0.021), ('vanilla', 0.021), ('varied', 0.021), ('states', 0.021), ('arti', 0.021), ('visit', 0.021), ('dayan', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
2 0.28899255 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer
Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1
3 0.18672952 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
Author: Dongho Kim, Kee-eung Kim, Pascal Poupart
Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1
4 0.16268642 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
5 0.15630551 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume
Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1
6 0.14742452 38 nips-2012-Algorithms for Learning Markov Field Policies
7 0.1389816 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
8 0.13773534 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
9 0.1340386 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
10 0.13381261 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
11 0.12883498 344 nips-2012-Timely Object Recognition
12 0.12868324 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
13 0.11586288 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
14 0.11571768 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
15 0.10705162 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
16 0.10621543 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
17 0.10551166 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
18 0.099776633 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
19 0.099627368 160 nips-2012-Imitation Learning by Coaching
20 0.099195883 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
topicId topicWeight
[(0, 0.185), (1, -0.326), (2, -0.039), (3, -0.015), (4, -0.106), (5, -0.02), (6, -0.008), (7, -0.059), (8, -0.059), (9, 0.022), (10, -0.02), (11, -0.02), (12, 0.034), (13, 0.002), (14, -0.064), (15, -0.015), (16, -0.004), (17, 0.06), (18, 0.004), (19, 0.017), (20, -0.034), (21, 0.077), (22, 0.031), (23, 0.026), (24, 0.024), (25, 0.056), (26, 0.026), (27, -0.041), (28, 0.046), (29, 0.06), (30, -0.005), (31, -0.014), (32, -0.055), (33, -0.04), (34, -0.005), (35, 0.078), (36, 0.004), (37, 0.047), (38, 0.026), (39, -0.04), (40, 0.003), (41, 0.006), (42, 0.088), (43, -0.026), (44, -0.04), (45, -0.055), (46, 0.059), (47, 0.065), (48, -0.124), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.94666296 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
2 0.84936523 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
Author: Dongho Kim, Kee-eung Kim, Pascal Poupart
Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1
3 0.82328945 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer
Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1
4 0.80168122 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
Author: Feng Cao, Soumya Ray
Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1
5 0.77185565 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
Author: Felipe Trevizan, Manuela Veloso
Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artificial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always finishes and is asymptotically optimal under sufficient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1
6 0.74916381 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
7 0.71367264 348 nips-2012-Tractable Objectives for Robust Policy Optimization
8 0.66454297 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
9 0.65546566 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
10 0.61387765 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
11 0.60622829 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
12 0.60617262 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
13 0.60567898 38 nips-2012-Algorithms for Learning Markov Field Policies
14 0.6046645 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
15 0.60264206 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
16 0.59700131 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand
17 0.58663619 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
18 0.58419329 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
19 0.57035941 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
20 0.54485971 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
topicId topicWeight
[(0, 0.032), (21, 0.036), (36, 0.013), (38, 0.103), (42, 0.028), (53, 0.017), (54, 0.1), (55, 0.01), (64, 0.226), (74, 0.066), (76, 0.103), (80, 0.096), (91, 0.025), (92, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.77213019 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
2 0.74146795 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
3 0.68963599 147 nips-2012-Graphical Models via Generalized Linear Models
Author: Eunho Yang, Genevera Allen, Zhandong Liu, Pradeep K. Ravikumar
Abstract: Undirected graphical models, also known as Markov networks, enjoy popularity in a variety of applications. The popular instances of these models such as Gaussian Markov Random Fields (GMRFs), Ising models, and multinomial discrete models, however do not capture the characteristics of data in many settings. We introduce a new class of graphical models based on generalized linear models (GLMs) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate multivariate Markov networks given any univariate exponential distribution, such as Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We also provide examples of non-Gaussian high-throughput genomic networks learned via our GLM graphical models. 1
4 0.67764038 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller
Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1
5 0.67033339 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy
Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1
6 0.66956645 344 nips-2012-Timely Object Recognition
7 0.6668058 27 nips-2012-A quasi-Newton proximal splitting method
8 0.66364777 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
9 0.65345496 221 nips-2012-Multi-Stage Multi-Task Feature Learning
10 0.65234822 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
11 0.64568359 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
12 0.64309049 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
13 0.63977146 38 nips-2012-Algorithms for Learning Markov Field Policies
14 0.63930404 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
15 0.63813782 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
16 0.6349358 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
17 0.63363826 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs
18 0.63259059 160 nips-2012-Imitation Learning by Coaching
19 0.63118589 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
20 0.63023996 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning