nips nips2012 nips2012-350 knowledge-graph by maker-knowledge-mining

350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Source: pdf

Author: Felipe Trevizan, Manuela Veloso

Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artiﬁcial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always ﬁnishes and is asymptotically optimal under sufﬁcient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. [sent-5, score-0.458]

2 In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. [sent-6, score-0.584]

3 We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. [sent-12, score-0.66]

4 Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. [sent-13, score-0.194]

5 1 Introduction The uncertainty of plan execution can be modeled by using probabilistic effects in actions, and therefore the probability of reaching different states from a given state and action. [sent-14, score-0.295]

6 This search space, deﬁned by the probabilistic paths from the initial state to a goal state, challenges the scalability of planners. [sent-15, score-0.187]

7 Planners manage the uncertainty by choosing a search strategy to explore the space. [sent-16, score-0.178]

8 In this work, we present a novel approach to manage uncertainty for probabilistic planning problems that improves its scalability while still being optimal. [sent-17, score-0.347]

9 One approach to manage uncertainty while searching for the solution of probabilistic planning problems is to consider the complete search space at once. [sent-18, score-0.379]

10 Examples of such algorithms are value iteration and policy iteration [2]. [sent-19, score-0.119]

11 , a universal mapping function from every state to the optimal action that leads to a goal state. [sent-22, score-0.16]

12 Assuming the model correctly captures the cost and uncertainty of the actions in the environment, closed policies are extremely powerful as their execution never “fails,” and the planner does not need to be re-invoked ever. [sent-23, score-0.388]

13 Value iteration based probabilistic planners can be improved by combining asynchronous updates and heuristic search [3–7]. [sent-25, score-0.291]

14 Although these techniques allow planners to compute compact policies, in the worst case, these policies are still linear in the size of the state space, which itself can be exponential in the size of the state or goals. [sent-26, score-0.293]

15 1 Another approach to manage uncertainty is basically to ignore uncertainty during planning, i. [sent-27, score-0.196]

16 Besides the action space simpliﬁcation, uncertainty management can be performed by simplifying the problem horizon, i. [sent-34, score-0.099]

17 The state space can also be simpliﬁed to manage uncertainty in probabilistic planning. [sent-38, score-0.25]

18 EP computes an initial partial policy π and then prunes all the states not considered by π. [sent-40, score-0.201]

19 In this paper, we introduced trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel model to manage uncertainty in probabilistic planning problems. [sent-44, score-0.329]

20 Trajectory-based short-sighted SSPs manage uncertainty by pruning the state space based on the most likely trajectory between states and deﬁning artiﬁcial goal states that guide the solution towards the original goal. [sent-45, score-0.438]

21 A policy π deﬁned only for the states reachable from s0 when following π is a closed policy w. [sent-57, score-0.371]

22 For instance, in the SSP depicted in Figure 1(a), the policy π0 = {(s0 , a0 ), (s′ , a0 ), (s′ , a0 ), (s′ , a0 )} is a closed policy 1 2 3 w. [sent-61, score-0.26]

23 1 2 3 Given a policy π, we deﬁne trajectory as a sequence Tπ = s(0) , . [sent-65, score-0.155]

24 The probability of a trai<|T | jectory Tπ is deﬁned as P (Tπ ) = i=0 π P (s(i+1) |s(i) , π(s(i) )) and maximum probability of a trajectory between two states Pmax (s, s′ ) is deﬁned as maxπ P (Tπ = s, . [sent-69, score-0.1]

25 An optimal policy π ∗ for an SSP is any policy that always reaches a goal state when followed from s0 and also minimizes the expected cost of Tπ∗ . [sent-73, score-0.418]

26 , the mapping from states to the minimum expected cost to reach a goal state, is unique. [sent-76, score-0.125]

27 The initial state is s0 , the goal state is sG , C(s, a, s ) = 1, ∀s ∈ S, a ∈ A and s′ ∈ S. [sent-131, score-0.171]

28 Given an SSP S, if a goal state can be reached with positive probability from every state s ∈ S, then the reachability assumption (Deﬁnition 1) holds for S and 0 ≤ V ∗ (s) < ∞ [19]. [sent-137, score-0.265]

29 Once V ∗ is known, any optimal policy π ∗ can be extracted from V ∗ by substituting the operator min by argmin in equation (1). [sent-138, score-0.137]

30 Given an SSP S = S, s0 , G, A, P, C , a state s ∈ S, ρ ∈ [0, 1] and a heuristic H, the (s, ρ)-trajectory-based short-sighted SSP Ss,ρ = Ss,ρ , s, Gs,ρ , A, P, Cs,ρ associated with S is deﬁned as: • Ss,ρ = {s′ ∈ S|∃ˆ ∈ S and a ∈ A s. [sent-164, score-0.098]

31 Our novel model, the trajectory-based short-sighted SSPs (Deﬁnition 4), addresses the issue of states with low trajectory probability by explicitly deﬁning its state space Ss,ρ based on maximum probability of a trajectory between s and the candidate states s′ (Pmax (s, s′ )). [sent-167, score-0.26]

32 This example 1 2 3 shows how trajectory-based short-sighted SSP can manage uncertainty efﬁciently: for ρ = 0. [sent-172, score-0.146]

33 Notice that the deﬁnition of Ss,ρ cannot be simpliﬁed to {ˆ ∈ S|Pmax (s, s) ≥ ρ} since not all s ˆ the resulting states of actions would be included in Ss,ρ . [sent-174, score-0.132]

34 Due to the reduced size of the short-sighted problems, SSiPP solves each of them by computing a closed policy w. [sent-181, score-0.141]

35 Alternatively, an anytime behavior is obtained if the execution of the computed closed policy for the short-sighted SSP is simulated (Algorithm 1 line 4) until an artiﬁcial goal sa is reached and this procedure is repeated, starting sa , until convergence or an interruption. [sent-186, score-0.267]

36 In [1], we proved that SSiPP always terminates and is asymptotically optimal for depth-based shortsighted SSPs. [sent-187, score-0.132]

37 O PTIMAL -SSP-S OLVER returns an optimal policy π ∗ w. [sent-200, score-0.137]

38 Given an SSP S = S, s0 , G, A, P, C such that the reachability assumption holds, an admissible heuristic H and a short-sighted problem generator that respects Deﬁnition 5, then SSiPP always terminates. [sent-213, score-0.187]

39 Since O PTIMAL -S OLVER always ﬁnishes and the short-sighted SSP is an SSP by deﬁnition, then a goal state sG of the short-sighted SSP is always reached, therefore the loop in line 3 of Algorithm 1 always ﬁnishes. [sent-215, score-0.15]

40 , sG differs from the state s used as initial state for the short-sighted SSP generation. [sent-219, score-0.138]

41 Suppose, for contradiction purpose, that every goal state reached during SSiPP execution is an artiﬁcial goal, i. [sent-221, score-0.186]

42 Thus every execution of SSiPP reaches a goal state s′ ∈ G and therefore G terminates. [sent-228, score-0.186]

43 Clearly, S(π ∗ , s0 ) ⊆ S∗ since a partial policy cannot be executed ad inﬁnitum without reaching a state in which it is not deﬁned. [sent-233, score-0.23]

44 Since SSiPP performs Bellman updates in the original SSP space (Lemma 2) and every execution of SSiPP terminates (Theorem 3), then we can view the sequence of lower bounds V 0 , V 1 , · · · , V t generated by SSiPP as asynchronous value iteration. [sent-234, score-0.114]

45 5 80 10 70 10 60 Number of States (log scale) 10 50 10 40 10 30 10 20 10 10 10 |S(π*,s )| 0 |S| 0 10 0 (a) 5 10 15 20 25 30 35 40 Triangle Tireworld Problem Size 45 50 55 60 (b) Figure 2: (a) Map of the triangle tireworld for the sizes 1, 2 and 3. [sent-238, score-0.278]

46 Circles (squares) represent locations in which there is one (no) spare tire. [sent-239, score-0.121]

47 The shades of gray represent, for each location l, maxπ P (car reaches l and the tire is not ﬂat when following the policy π from s0 ). [sent-240, score-0.313]

48 (b) Log-lin plot of the state space size (|S|) and the size of the states reachable from s0 when following the optimal policy π ∗ (|S(π ∗ , s0 )|) versus the number of the triangle tireworld problem. [sent-241, score-0.586]

49 5 Experiments We present two sets of experiments using the triangle tireworld problems [9, 11, 20], a series of probabilistic interesting problems [14] in which a car has to travel between locations in order to reach a goal location from its initial location. [sent-242, score-0.594]

50 The roads are represented as directed graph in a shape of a triangle and, every time the car moves between locations, a ﬂat tire happens with probability 0. [sent-243, score-0.255]

51 Some locations have a spare tire and in these locations the car can deterministically replace its ﬂat tire by new one. [sent-245, score-0.41]

52 When the car has a ﬂat tire, it cannot change its location, therefore the car can get stuck in locations that do not have a spare tire (dead-ends). [sent-246, score-0.341]

53 Figure 2(a) depicts the map of the triangle tireworld problems 1, 2 and 3 and Figure 2(b) shows the size of S and S(π ∗ , s0 ) for problems up to size 60. [sent-247, score-0.314]

54 , 28 nodes in the corresponding graph on Figure 2(a), its state space has 19562 states and its optimal policy reaches 8190 states. [sent-250, score-0.33]

55 Every triangle tireworld problem is a probabilistic interesting problem [14]: there is only one policy that reaches the goal with probability 1 and all the other policies have probability at most 0. [sent-251, score-0.553]

56 This property is illustrated by the shades of gray in Figure 2(a) that represents, for each location l, maxπ P (car reaches l and the tire is not ﬂat when following the policy π from s0 ). [sent-255, score-0.313]

57 For UCT, we disabled the random rollouts because the probability of any policy other than the optimal policy to reach a dead-end is at least 0. [sent-260, score-0.284]

58 The solution computed during one round is simulated by MDPSIM in a client-server loop: MDPSIM sends a state s and requests an action from the planner, then the planner replies by sending the action a to be executed in s. [sent-267, score-0.381]

59 The evaluation is done by the number of rounds simulated by MDPSIM that reached a goal state. [sent-268, score-0.169]

60 The maximum number of actions allowed per round is 2000 and rounds that exceed this limit are stopped by MDPSIM and declared as failure, i. [sent-269, score-0.184]

61 0 Table 1: Number of rounds solved out of 50 for experiment in Section 5. [sent-308, score-0.102]

62 9 Table 2: Number of rounds solved out of 50 for experiment in Section 5. [sent-419, score-0.102]

63 Formally, to decide what action to apply in a given state s, each planner is allowed to use at most B = |Ss,t | search nodes, i. [sent-430, score-0.317]

64 We choose t equals to 8 since it obtains the best performance in the triangle tireworld problems [1]. [sent-433, score-0.296]

65 Given the search nodes budget B, for UCT we sample the environment until the search tree contains B nodes; and for trajectory-based SSiPP we use ρ = argmaxρ {|Ss,ρ | s. [sent-434, score-0.105]

66 The methodology for this experiment is as follows: for each problem, 10 runs of 50 rounds are performed for each planner using the search nodes budget B. [sent-437, score-0.354]

67 We set as time and memory cut-off 8 hours and 8 Gb, respectively, and UCT for problems 35 to 60 was the only planner preempted by the time cut-off. [sent-439, score-0.194]

68 Trace-based SSiPP outperforms both depth-based SSiPP and UCT, solving all the 50 rounds in all the 10 runs for all the problems. [sent-440, score-0.111]

69 2 Fixed maximum planning time In this experiment, we compare planners by limiting the maximum planning time. [sent-442, score-0.422]

70 The methodology used in this experiment is similar to the one in IPPC’04 and IPPC’06: for each problem, planners need to solve 1 run of 50 rounds in 20 minutes. [sent-443, score-0.246]

71 For this experiment, the planners are allowed to perform internal simulations, for instance, a planner can spend 15 minutes solving rounds using internal simulations and then use the computed policy to solve the required 50 rounds through MDPSIM in the remaining 5 minutes. [sent-444, score-0.611]

72 The winners of IPPC’04, IPPC’06 and IPPC’08 are 7 omitted since their performance on the triangle tireworld problems are strictly dominated by depthbase SSiPP for t = 8. [sent-451, score-0.327]

73 All the four parametrizations of trajectory-based SSiPP outperform the other planners for problems of size equal or greater than 45. [sent-453, score-0.193]

74 , it reaches a goal state in all the 50 rounds in all the 10 runs for all the problems. [sent-457, score-0.254]

75 125 reaches the 20 minutes time cut-off before solving 50 rounds, however all the solved rounds successfully reach the goal. [sent-461, score-0.164]

76 This interesting behavior of trajectory-based SSiPP for the triangle tireworld can be explained by the following theorem: Theorem 5. [sent-462, score-0.278]

77 For the triangle tireworld, trajectory-based SSiPP using an admissible heuristic never falls in a dead-end for ρ ∈ (0. [sent-463, score-0.183]

78 The optimal policy for the triangle tireworld is to follow the longest path: move from the initial location l0 to the goal location lG passing through location lc , where l0 , lc and lG are the vertices of the triangle formed by the problem’s map. [sent-470, score-0.809]

79 , there is only one applicable move-car action for all the locations in this path. [sent-473, score-0.099]

80 Therefore all the decision making to ﬁnd the optimal policy happens between the locations l0 and lc . [sent-474, score-0.24]

81 Each location l′ in the path from l0 to lc has either two or three applicable move-car actions and we refer to the set of locations l′ with three applicable move-car actions as N. [sent-475, score-0.312]

82 Every location l′ ∈ N is reachable from l0 by applying an even number of move-car actions (Figure 2(a)) and the three applicable move-car actions in l′ are: (i) the optimal action ac , i. [sent-476, score-0.339]

83 , move the car towards lc ; (ii) the action aG that moves the car towards lG ; and (iii) the action ap that moves the car parallel to the shortest-path from l0 to lG . [sent-478, score-0.384]

84 The location reached by ap does not have a spare tire, therefore ap is never selected by a greedy choice over any admissible heuristic since it reaches a dead-end with probability 0. [sent-479, score-0.326]

85 The locations reached by applying either ac or aG have a spare tire and the greedy choice between them depends on the admissible heuristic used, thus aG might be selected instead of ac . [sent-481, score-0.436]

86 However, after applying aG , only one move-car action a is available and it reaches a location that does not have a spare tire. [sent-482, score-0.21]

87 Therefore, the greedy choice between ac and aG considering two or more move-car actions is optimal under any admissible heuristic: every sequence of actions aG , a, . [sent-483, score-0.246]

88 5 and at least one sequence of actions starting with ac has probability 0 to reach a dead-end, e. [sent-487, score-0.145]

89 Given ρ, we denote as Ls,ρ the set of all locations corresponding to states in Ss,ρ and as ls the location corresponding to the state s. [sent-490, score-0.262]

90 Thus, Ls,ρ contains all the locations reachable from ls using up to m = ⌊log0. [sent-491, score-0.145]

91 If m is even and ls ∈ N, then every location in Ls,ρ ∩ N represents a state either in Gs,ρ or at least two move-car actions away from any state in Gs,ρ . [sent-493, score-0.276]

92 Therefore the solution of the (s, ρ)-trajectory-based short-sighted SSP only chooses the action ac to move the car. [sent-494, score-0.113]

93 Also, since m is even, every state s used by SSiPP for generating (s, ρ)-trajectory-based short-sighted SSPs has ls ∈ N. [sent-495, score-0.108]

94 }, trajectory-based SSiPP always chooses the actions ac to move the car to lc , thus avoiding the all dead-ends. [sent-503, score-0.271]

95 6 Conclusion In this paper, we introduced trajectory-based short-sighted SSPs, a new model to manage uncertainty in probabilistic planning problems. [sent-504, score-0.329]

96 This approach consists of pruning the state space based on the most likely trajectory between states and deﬁning artiﬁcial goal states that guide the solution towards the original goals. [sent-505, score-0.292]

97 We empirically compared trajectory-based SSiPP with depth-based SSiPP and other state-of-the-art planners in the triangle tireworld. [sent-507, score-0.246]

98 Trajectory-based SSiPP outperforms all the other planners and it is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states, under the IPPC evaluation methodology. [sent-508, score-0.338]

99 The ﬁrst probabilistic track of the international planning competition. [sent-575, score-0.183]

100 RFF: A robust, FF-based mdp planning algorithm for generating policies with low probability of failure. [sent-581, score-0.168]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ssipp', 0.635), ('ssp', 0.363), ('ssps', 0.306), ('sg', 0.189), ('planner', 0.176), ('tireworld', 0.176), ('planners', 0.144), ('planning', 0.139), ('uct', 0.135), ('policy', 0.119), ('pmax', 0.106), ('triangle', 0.102), ('ss', 0.098), ('manage', 0.096), ('ippc', 0.094), ('tire', 0.086), ('rounds', 0.086), ('mdpsim', 0.071), ('spare', 0.071), ('actions', 0.068), ('car', 0.067), ('states', 0.064), ('reachability', 0.062), ('state', 0.06), ('determinization', 0.059), ('icaps', 0.059), ('ptimal', 0.059), ('bellman', 0.055), ('lc', 0.053), ('olver', 0.052), ('uncertainty', 0.05), ('reached', 0.05), ('reaches', 0.05), ('locations', 0.05), ('action', 0.049), ('ac', 0.049), ('lg', 0.048), ('ls', 0.048), ('reachable', 0.047), ('probabilistic', 0.044), ('admissible', 0.043), ('execution', 0.043), ('location', 0.04), ('ag', 0.039), ('params', 0.038), ('shortest', 0.038), ('terminates', 0.038), ('heuristic', 0.038), ('trajectory', 0.036), ('lrtdp', 0.035), ('shortsighted', 0.035), ('gs', 0.035), ('scheduling', 0.034), ('reaching', 0.034), ('path', 0.033), ('goal', 0.033), ('search', 0.032), ('yoon', 0.031), ('parametrizations', 0.031), ('winners', 0.031), ('round', 0.03), ('competition', 0.029), ('policies', 0.029), ('arti', 0.028), ('reach', 0.028), ('mina', 0.027), ('nition', 0.026), ('cial', 0.025), ('runs', 0.025), ('generator', 0.025), ('bonet', 0.024), ('enerate', 0.024), ('ighted', 0.024), ('rtdp', 0.024), ('trajectorybased', 0.024), ('trevizan', 0.024), ('automated', 0.023), ('heuristically', 0.022), ('environment', 0.022), ('asymptotically', 0.022), ('closed', 0.022), ('nodes', 0.019), ('always', 0.019), ('pruning', 0.019), ('problems', 0.018), ('initial', 0.018), ('shades', 0.018), ('optimal', 0.018), ('admissibility', 0.017), ('ap', 0.017), ('executed', 0.017), ('dence', 0.017), ('updates', 0.017), ('asynchronous', 0.016), ('goals', 0.016), ('guide', 0.016), ('fern', 0.016), ('nishes', 0.016), ('experiment', 0.016), ('move', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

2 0.10719752 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a ﬁnite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1

3 0.10621543 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

4 0.095932044 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

5 0.089520775 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method signiﬁcantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classiﬁers as black boxes and learns from execution traces using reinforcement learning. 1

6 0.088210776 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

7 0.085683063 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

8 0.084899686 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

9 0.077732049 348 nips-2012-Tractable Objectives for Robust Policy Optimization

10 0.072514065 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

11 0.071544677 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

12 0.071088567 160 nips-2012-Imitation Learning by Coaching

13 0.058677621 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

14 0.056761645 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.050321493 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

16 0.049410179 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

17 0.048329435 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

18 0.045648638 205 nips-2012-MCMC for continuous-time discrete-state systems

19 0.04280366 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

20 0.041970007 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.088), (1, -0.169), (2, -0.012), (3, -0.026), (4, -0.018), (5, -0.006), (6, -0.001), (7, -0.037), (8, -0.003), (9, -0.003), (10, -0.014), (11, -0.028), (12, 0.01), (13, 0.004), (14, -0.006), (15, -0.0), (16, -0.032), (17, -0.006), (18, -0.025), (19, 0.05), (20, -0.018), (21, 0.021), (22, -0.004), (23, -0.027), (24, 0.019), (25, 0.003), (26, -0.003), (27, 0.0), (28, 0.008), (29, 0.024), (30, 0.006), (31, -0.026), (32, -0.013), (33, 0.01), (34, -0.009), (35, 0.002), (36, 0.004), (37, 0.034), (38, -0.034), (39, 0.043), (40, 0.058), (41, -0.064), (42, 0.033), (43, -0.004), (44, -0.036), (45, -0.04), (46, -0.021), (47, 0.061), (48, 0.015), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90440643 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

2 0.70145422 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efﬁciently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efﬁciently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

3 0.6983484 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

4 0.6809839 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider inﬁnite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a signiﬁcant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

5 0.67952263 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The ﬁeld has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functﬁon. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.67253739 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

7 0.66277373 38 nips-2012-Algorithms for Learning Markov Field Policies

8 0.63039118 160 nips-2012-Imitation Learning by Coaching

9 0.60663068 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

10 0.59858328 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

11 0.58497411 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

12 0.56536072 344 nips-2012-Timely Object Recognition

13 0.56497037 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

14 0.56473976 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

15 0.5588485 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

16 0.55630672 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

17 0.55028409 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

18 0.53450364 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

19 0.53447509 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

20 0.52949429 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (17, 0.016), (20, 0.368), (21, 0.012), (38, 0.079), (42, 0.031), (54, 0.071), (55, 0.028), (74, 0.046), (76, 0.101), (80, 0.087), (92, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67705601 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

Author: Felipe Trevizan, Manuela Veloso

2 0.54323977 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods signiﬁcantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

3 0.51681513 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng

Abstract: We apply salient feature detection and tracking in videos to simulate ﬁxations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classiﬁcation accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1

4 0.46308556 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

Author: Arnak Dalalyan, Yin Chen

Abstract: In this paper, we develop a novel approach to the problem of learning sparse representations in the context of fused sparsity and unknown noise level. We propose an algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes the aforementioned learning task by means of a second-order cone program. A special emphasize is put on the particular instance of fused sparsity corresponding to the learning in presence of outliers. We establish ﬁnite sample risk bounds and carry out an experimental evaluation on both synthetic and real data. 1

5 0.4382309 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a ﬂexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efﬁcient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

6 0.43402877 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

7 0.43240437 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

8 0.43047744 344 nips-2012-Timely Object Recognition

9 0.42807978 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.42267692 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

11 0.42090115 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

12 0.41923076 279 nips-2012-Projection Retrieval for Classification

13 0.41921735 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

14 0.41826147 38 nips-2012-Algorithms for Learning Markov Field Policies

15 0.41703418 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

16 0.41572261 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

17 0.41558 348 nips-2012-Tractable Objectives for Robust Policy Optimization

18 0.41549093 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

19 0.41544932 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

20 0.41522399 160 nips-2012-Imitation Learning by Coaching