nips nips2013 nips2013-250 knowledge-graph by maker-knowledge-mining

250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning


Source: pdf

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. [sent-5, score-0.623]

2 We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. [sent-8, score-0.797]

3 This paper focuses specifically on integrating human feedback with Reinforcement Learning. [sent-11, score-0.65]

4 One way to address this problem is to treat human feedback as a shaping reward [1–5]. [sent-12, score-1.147]

5 Yet, recent papers have observed that a more effective use of human feedback is as direct information about policies [6, 7]. [sent-13, score-0.623]

6 Most techniques for learning from human feedback still, however, convert feedback signals into a reward or a value. [sent-14, score-1.371]

7 In this paper we introduce Policy Shaping, which formalizes the meaning of human feedback as policy feedback, and demonstrates how to use it directly as policy advice. [sent-15, score-0.971]

8 We also introduce Advise, an algorithm for estimating a human’s Bayes optimal feedback policy and a technique for combining this with the policy formed from the agent’s direct experience in the environment (Bayesian Q-Learning). [sent-16, score-0.838]

9 The results demonstrate two advantages of Advise: 1) it is able to outperform state of the art techniques for integrating human feedback with Reinforcement Learning; and 2) by formalizing human feedback, we avoid ad hoc parameter settings and are robust to infrequent and inconsistent feedback. [sent-19, score-0.99]

10 An MDP is specified by the tuple (S, A, T, R), which defines the set of possible world states, S, the set of actions available to the agent in each state, A, the transition function T : S × A → Pr[S], a reward function R : S × A → R, and a discount factor 0 ≤ γ ≤ 1. [sent-21, score-0.442]

11 Thus, the reward function acts as a single source of information that tells an agent what is the best policy for this MDP. [sent-23, score-0.591]

12 Q-learning is one way to find an optimal 1 policy from the environment reward signal. [sent-25, score-0.516]

13 A specific Q-value, Q[s, a], represents a point estimate of the long-term expected discounted reward for taking action a in state s. [sent-27, score-0.518]

14 These values are updated each 0 time an agent performs an action a in state s, accumulates reward r, and transitions to a new state s′ . [sent-31, score-0.643]

15 The reward signal can be modified to suit the addition of a new information source (this is known as reward shaping [11]). [sent-38, score-0.824]

16 This is the most common way human feedback has been applied to RL [1–5]. [sent-39, score-0.623]

17 However, several difficulties arise when integrating human feedback signals that may be infrequent, or occasionally inconsistent with the optimal policy–violating the necessary and sufficient condition that a shaping function be potential-based [11]. [sent-40, score-0.954]

18 Researchers have also extended reward shaping to account for idiosyncrasies in human input. [sent-43, score-0.699]

19 For example, adding a drift parameter to account for the human tendency to give less feedback over time [1, 12]. [sent-44, score-0.623]

20 Advancements in recent work sidestep some of these issues by showing human feedback can instead be used as policy feedback. [sent-45, score-0.797]

21 For example, Thomaz and Breazeal [6] added an UNDO function to the negative feedback signal, which forced an agent to backtrack to the previous state after its value update. [sent-46, score-0.589]

22 Work by Knox and Stone [7, 13] has shown that a general improvement to learning from human feedback is possible if it is used to directly modify the action selection mechanism of the Reinforcement Learning algorithm. [sent-47, score-0.834]

23 Although both approaches use human feedback to modify an agent’s exploration policy, they still treat human feedback as either a reward or a value. [sent-48, score-1.589]

24 In our work, we assume human feedback is not an evaluative reward, but is a label on the optimality of actions. [sent-49, score-0.656]

25 Thus the human’s feedback is making a direct statement about the policy itself, rather than influencing the policy through a reward. [sent-50, score-0.796]

26 In other works, rather than have the human input be a reward shaping input, the human provides demonstrations of the optimal policy. [sent-51, score-0.929]

27 Several papers have shown how the policy information in human demonstrations can be used for inverse optimal control [14, 15], to seed an agent’s exploration [16, 17], and in some cases be used entirely in place of exploration [18, 19]. [sent-52, score-0.494]

28 Our position that human feedback be used as direct policy advice is related to work in transfer learning [20, 21], in which an agent learns with “advice” about how it should behave. [sent-54, score-0.978]

29 Our approach only requires very high-level feedback (right/wrong) and is provided interactively. [sent-56, score-0.448]

30 4 Policy Shaping In this section, we formulate human feedback as policy advice, and derive a Bayes optimal algorithm for converting that feedback into a policy. [sent-57, score-1.292]

31 We also describe how to combine the feedback policy with the policy of an underlying Reinforcement Learning algorithm. [sent-58, score-0.796]

32 Using feedback in this way is not a trivial matter of pruning actions from the search tree. [sent-67, score-0.473]

33 Here, we assume a human providing feedback knows the right answer, but noise in the feedback channel introduces inconsistencies between what the human intends to communicate and what the agent observes. [sent-69, score-1.363]

34 Thus, feedback is consistent, C, with the optimal policy with probability 0 < C < 1. [sent-70, score-0.647]

35 1 We also assume that a human watching an agent learn may not provide feedback after every single action, thus the likelihood, L, of receiving feedback has probability 0 < L < 1. [sent-71, score-1.188]

36 In the event feedback is received, it is interpreted as a comment on the optimality of the action just performed. [sent-72, score-0.654]

37 The issue of credit assignment that naturally arises with learning from real human feedback is left for future work (see [13] for an implementation of credit assignment in a different framework for learning from human feedback). [sent-73, score-0.84]

38 More formally: C ∆s,a (1 − C) j=a ∆s,j (2) We take Equation 2 to be the probability of performing s, a according to the feedback policy, πF (i. [sent-82, score-0.448]

39 This is the Bayes optimal feedback policy given the “right” and “wrong” labels seen, the value for C, and that only one action is optimal per state. [sent-85, score-0.85]

40 5, the total amount of information received from the human should be enough for the the agent to choose the optimal policy with probability 1. [sent-96, score-0.522]

41 Before an agent is completely confident in either policy, however, it has to determine what action to perform using the policy information each provides. [sent-99, score-0.469]

42 1 Note that the consistency of feedback is not the same as the human’s or the agent’s confidence the feedback is correct. [sent-100, score-0.97]

43 Note that BQL can only approximately estimate the uncertainty that each action is optimal from the environment reward signal. [sent-108, score-0.52]

44 3 Constructing an Oracle We used a simulated oracle in the place of human feedback, because this allows us to systematically vary the parameters of feedback likelihood, L, and consistency, C and test different learning settings in which human feedback is less than ideal. [sent-133, score-1.265]

45 For states with multiple optimal actions, a small negative reward (-10) was added to the environment reward signal of the extra optimal state-action pairs to preserve the assumption that only one action be optimal in each state. [sent-135, score-0.87]

46 Knox and Stone [7, 13] tried eight different strategies for combining feedback with an environmental reward signal and they found that 4 Ideal Case BQL + Action Biasing BQL + Control Sharing BQL + Reward Shaping BQL + Advise Reduced Consistency Reduced Frequency Moderate Case (L = 1. [sent-138, score-0.748]

47 06 Table 1: Comparing the learning rates of BQL + Advise to BQL + Action Biasing, BQL + Control Sharing, and BQL + Reward Shaping for four different combinations of feedback likelihood, L, and consistency, C, across two domains. [sent-209, score-0.448]

48 Each entry represents the average and standard deviation of the cumulative reward in 300 episodes, expressed as the percent of the maximum possible cumulative reward for the domain with respect to the BQL baseline. [sent-210, score-0.625]

49 Both of these methods use human feedback rewards to modify the policy, rather than shape the MDP reward function. [sent-214, score-0.955]

50 Thus, they still convert human feedback to a value but recognize that the information contained in that value is policy information. [sent-215, score-0.797]

51 Action Biasing uses human feedback to bias the action selection mechanism of the underlying RL algorithm. [sent-217, score-0.815]

52 Positive and negative feedback is declared a reward rh , and −rh , respectively. [sent-218, score-0.933]

53 A table of values, H[s, a] stores the feedback signal for s, a. [sent-219, score-0.448]

54 The modified action selection mechanism is ˆ ˆ given as argmaxa Q(s, a)+ B[s, a]∗ H[s, a], where Q(s, a) is an estimate of the long-term expected discounted reward for s, a from BQL, and B[s, a] controls the influence of feedback on learning. [sent-220, score-0.97]

55 The value of B[s, a] is incremented by a constant b when feedback is received for s, a, and is decayed by a constant d at all other time steps. [sent-221, score-0.488]

56 Control Sharing modifies the action selection mechanism directly with the addition of a transition between 1) the action that gains an agent the maximum known reward according to feedback, and 2) the policy produced using the original action selection method. [sent-222, score-1.139]

57 An agent transfers control to a feedback policy as feedback is received, and begins to switch control to the underlying RL algorithm as B[s, a] decays. [sent-225, score-1.271]

58 Although feedback is initially interpreted as a reward, Control Sharing does not use that information, and thus is unaffected if the value of rh is changed. [sent-226, score-0.648]

59 Action Biasing, Control Sharing, and Reward Shaping used a feedback influence of b = 1 and a decay factor of d = 0. [sent-240, score-0.448]

60 For Reward Shaping we set rh = 100 in Pac-Man and rh = 1 in Frogger 2 We compared the methods using four different combinations of feedback likelihood, L, and consistency, C, in Pac-Man and Frogger, for a total of eight experiments. [sent-243, score-0.818]

61 In the ideal case of frequent and correct feedback (L = 1. [sent-247, score-0.469]

62 A human reward that does not match both the feedback consistency and the domain may fail to eliminate unnecessary exploration and produce learning rates similar to or worse than the baseline. [sent-251, score-1.046]

63 Advise avoided these issues by not converting feedback into a reward. [sent-252, score-0.47]

64 2 show one example from each of the non-ideal conditions that we tested: reduced feedback consistency (L = 1. [sent-254, score-0.567]

65 1; 2 We used the conversion rh = 1, 10, 100, or 1000 that maximized MDP reward in the ideal case to also evaluate the three cases of non-ideal feedback. [sent-257, score-0.52]

66 Action Biasing does better than Advise in one case in part because the feedback likelihood is high enough to counter Action Biasing’s overly influential feedback policy. [sent-274, score-0.909]

67 The cumulative reward numbers in Table 1 show that Advise always performed near or above the BQL baseline, which indicates robustness to reduced feedback frequency and consistency. [sent-279, score-0.793]

68 Thus having a prior estimate of the feedback consistency (the value of C) allows Advise to balance what it learns from the human appropriately with its own learned policy. [sent-283, score-0.697]

69 2 How The Reward Parameter Affects Action Biasing In contrast to Advise, Action Biasing and Control Sharing do not use an explicit model of the feedback consistency. [sent-290, score-0.448]

70 The optimal values to rh , b, and d for learning with consistent feedback may be the wrong values to use for learning with inconsistent feedback. [sent-291, score-0.75]

71 Here, we test how Action Biasing performed with a range of values for rh for the case of moderate feedback (L = 0. [sent-292, score-0.693]

72 The conversion from feedback into reward was set to either rh = 500 or 1000. [sent-300, score-0.947]

73 3 show that a large value for rh is appropriate for more consistent feedback; a small value for rh is best for reduced consistency. [sent-303, score-0.415]

74 This is clear in Pac-Man when a reward of rh = 1000 led to better-than-baseline learning performance in the moderate feedback case, but decreased learning rates dramatically below the baseline in the reduced consistency case. [sent-304, score-1.134]

75 A reward of zero produced the best results in the reduced consistency case. [sent-305, score-0.419]

76 d), can produce good results with moderate feedback in both Pac-Man and Frogger. [sent-310, score-0.508]

77 The use of a human influence parameter B[s, a] to modulate the value for rh is presumably meant to help make Action Biasing more robust to reduced consistency. [sent-311, score-0.418]

78 The value for B[s, a] is, however, increased by b whenever 3 The results with Reward Shaping are misleading because it can end up in infinite loops when feedback is infrequent or inconsistent with the optimal policy. [sent-312, score-0.599]

79 8) 0 reward rh −200 0 500 1000 −400 50 100 150 200 Number of Episodes 250 300 Average Reward Average Reward 200 −600 0 600 400 600 400 200 0 −200 200 0 100 150 200 250 300 Number of Episodes −600 0 200 0 −200 −400 50 (L = 1. [sent-321, score-0.485]

80 8) −400 50 100 150 200 Number of Episodes 250 300 −600 0 50 100 150 200 250 300 Number of Episodes Figure 3: How different feedback reward values affected BQL + Action Biasing. [sent-325, score-0.748]

81 Results were computed for the moderate feedback case (L = 0. [sent-328, score-0.508]

82 feedback is received, and reduced by d over time; b and d are more a function of the domain than the information in accumulated feedback. [sent-333, score-0.542]

83 In contrast, the estimated feedback consistency, C, used by Advise only depends on the true feedback consistency, C. [sent-349, score-0.896]

84 4 Using an Inaccurate Estimate of Feedback Consistency Interactions with a real human will mean that in most cases Advise will not have an exact estimate, ˆ ˆ C, of the true feedback consistency, C. [sent-352, score-0.623]

85 In contrast, using an overestimate of C boosted learning rates for these particular domains and case of feedback quality. [sent-370, score-0.463]

86 In general, however, overestimating C can lead to a suboptimal policy especially if feedback is provided very infrequently. [sent-371, score-0.622]

87 7 Discussion Overall, our experiments indicate that it is useful to interpret feedback as a direct comment on the optimality of an action, without converting it into a reward or a value. [sent-373, score-0.798]

88 The performance of Action Biasing and Control Sharing was not as good as Advise in many cases (as shown in Table 1) because they use feedback as policy information only after it has been converted into a reward. [sent-375, score-0.638]

89 55 −600 0 5 4 x 10 Figure 4: The larger Frogger domain and the corresponding learning results for the case of moderate feedback (L = 0. [sent-390, score-0.533]

90 Control Sharing is especially sensitive to how well the value of the feedback influence parameter, B[s, a], approximates the amount of information in both policies. [sent-402, score-0.463]

91 Its performance bottomed out in some cases with infrequent and inconsistent feedback because B[s, a] overestimated the amount of information in the feedback policy. [sent-403, score-1.023]

92 ˆ Advise has only one input parameter, the estimated feedback consistency, C, in contrast to three. [sent-407, score-0.448]

93 ˆ is a fundamental parameter that depends only on the true feedback consistency, C, and does not C ˆ change if the domain size is increased. [sent-408, score-0.473]

94 When it has the right value for C, Advise represents the exact amount of information in the accumulated feedback in each state, and then combines it with the BQL policy using an amount of influence equivalent to the amount of information in each policy. [sent-409, score-0.691]

95 We are also interested in addressing other aspects of human feedback like errors in credit assignment. [sent-414, score-0.644]

96 8 Conclusion This paper defined the Policy Shaping paradigm for integrating feedback with Reinforcement Learning. [sent-419, score-0.475]

97 We introduced Advise, which tries to maximize the utility of feedback using a Bayesian approach to learning. [sent-420, score-0.448]

98 With these advancements this paper may help to make learning from human feedback an increasingly viable option for intelligent systems. [sent-422, score-0.64]

99 Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Proc. [sent-474, score-0.866]

100 Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proc. [sent-502, score-0.6]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('feedback', 0.448), ('bql', 0.404), ('advise', 0.372), ('reward', 0.3), ('frogger', 0.266), ('shaping', 0.224), ('biasing', 0.19), ('rh', 0.185), ('action', 0.178), ('human', 0.175), ('policy', 0.174), ('reinforcement', 0.118), ('agent', 0.117), ('episodes', 0.096), ('sharing', 0.08), ('consistency', 0.074), ('moderate', 0.06), ('infrequent', 0.057), ('inconsistent', 0.055), ('knox', 0.047), ('reduced', 0.045), ('stone', 0.045), ('food', 0.043), ('control', 0.042), ('mdp', 0.04), ('wrong', 0.037), ('isbell', 0.035), ('advice', 0.033), ('pellets', 0.032), ('thomaz', 0.032), ('water', 0.031), ('demonstrations', 0.03), ('manually', 0.028), ('aamas', 0.028), ('rl', 0.028), ('cars', 0.027), ('integrating', 0.027), ('ghost', 0.026), ('domain', 0.025), ('actions', 0.025), ('optimal', 0.025), ('decayed', 0.024), ('exploration', 0.024), ('state', 0.024), ('accumulated', 0.024), ('consisted', 0.022), ('converting', 0.022), ('baseline', 0.022), ('credit', 0.021), ('avatar', 0.021), ('breazeal', 0.021), ('hazards', 0.021), ('maclin', 0.021), ('ideal', 0.021), ('uence', 0.02), ('duration', 0.02), ('ng', 0.02), ('curves', 0.02), ('grid', 0.02), ('modify', 0.019), ('oracle', 0.019), ('shelton', 0.019), ('ghosts', 0.019), ('kaelbling', 0.019), ('shavlik', 0.019), ('hazard', 0.019), ('evaluative', 0.019), ('russell', 0.018), ('advancements', 0.017), ('interactive', 0.017), ('environment', 0.017), ('sources', 0.016), ('transfer', 0.016), ('interactively', 0.016), ('watkins', 0.016), ('tuned', 0.016), ('discounted', 0.016), ('inaccurate', 0.016), ('received', 0.016), ('converted', 0.016), ('car', 0.016), ('art', 0.016), ('amount', 0.015), ('bayes', 0.015), ('unaffected', 0.015), ('cartesian', 0.015), ('overestimate', 0.015), ('position', 0.015), ('comment', 0.014), ('argmaxa', 0.014), ('mechanism', 0.014), ('loops', 0.014), ('conversion', 0.014), ('optimality', 0.014), ('detrimental', 0.013), ('road', 0.013), ('presumably', 0.013), ('likelihood', 0.013), ('hoc', 0.013), ('rewards', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

2 0.30647078 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis

Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1

3 0.18403704 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1

4 0.16915143 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

5 0.16440254 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1

6 0.1633082 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

7 0.1536918 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

8 0.1469453 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

9 0.14312734 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

10 0.14086185 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

11 0.13984436 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

12 0.13578795 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

13 0.13504077 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

14 0.12057952 348 nips-2013-Variational Policy Search via Trajectory Optimization

15 0.12017155 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

16 0.11994146 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

17 0.11851304 257 nips-2013-Projected Natural Actor-Critic

18 0.11251782 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

19 0.10500304 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

20 0.10147592 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.153), (1, -0.287), (2, -0.142), (3, 0.059), (4, -0.009), (5, -0.067), (6, 0.038), (7, -0.01), (8, -0.084), (9, 0.027), (10, -0.03), (11, 0.039), (12, 0.02), (13, -0.048), (14, -0.123), (15, 0.131), (16, 0.014), (17, 0.061), (18, 0.085), (19, -0.084), (20, -0.043), (21, 0.001), (22, -0.01), (23, 0.002), (24, -0.048), (25, -0.009), (26, 0.018), (27, -0.018), (28, -0.032), (29, 0.028), (30, 0.042), (31, -0.047), (32, -0.009), (33, 0.012), (34, 0.002), (35, -0.067), (36, -0.066), (37, 0.01), (38, -0.082), (39, -0.004), (40, 0.084), (41, 0.02), (42, 0.076), (43, 0.023), (44, 0.059), (45, 0.022), (46, -0.172), (47, 0.022), (48, 0.046), (49, -0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98137546 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

2 0.82665724 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis

Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1

3 0.63651311 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

Author: Shunan Zhang, Angela J. Yu

Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1

4 0.59809178 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

5 0.58589917 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Author: Aijun Bai, Feng Wu, Xiaoping Chen

Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results confirm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1

6 0.57577527 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

7 0.57210726 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.5619539 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.54558879 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

10 0.54074442 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

11 0.53759205 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

12 0.53482777 165 nips-2013-Learning from Limited Demonstrations

13 0.53349292 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths

14 0.51360518 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

15 0.4709354 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models

16 0.46982384 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

17 0.45197055 69 nips-2013-Context-sensitive active sensing in humans

18 0.45048514 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

19 0.44363391 347 nips-2013-Variational Planning for Graph-based MDPs

20 0.43776339 7 nips-2013-A Gang of Bandits


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (16, 0.04), (33, 0.08), (34, 0.121), (41, 0.054), (49, 0.024), (56, 0.162), (70, 0.025), (76, 0.241), (85, 0.044), (89, 0.02), (93, 0.034), (95, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81316644 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

Author: Mahito Sugiyama, Karsten Borgwardt

Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1

same-paper 2 0.78955024 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

3 0.78816491 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields

Author: Yifei Ma, Roman Garnett, Jeff Schneider

Abstract: A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, equivalent to the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used V-optimality criterion queries nodes that reduce the L2 (regression) loss. V-optimality satisfies a submodularity property showing that greedy reduction produces a (1 − 1/e) globally optimal solution. However, L2 loss may not characterise the true nature of 0/1 loss in classification problems and thus may not be the best choice for active learning. We consider a new criterion we call Σ-optimality, which queries the node that minimizes the sum of the elements in the predictive covariance. Σ-optimality directly optimizes the risk of the surveying problem, which is to determine the proportion of nodes belonging to one class. In this paper we extend submodularity guarantees from V-optimality to Σ-optimality using properties specific to GRFs. We further show that GRFs satisfy the suppressor-free condition in addition to the conditional independence inherited from Markov random fields. We test Σoptimality on real-world graphs with both synthetic and real data and show that it outperforms V-optimality and other related methods on classification. 1

4 0.76334488 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning

Author: Kamalika Chaudhuri, Staal A. Vinterbo

Abstract: Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and datamining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application. In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning – training a regularized linear classifier and building a histogram density estimator that result in end-toend differentially private solutions for these problems. 1

5 0.75781524 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

Author: Kohei Hayashi, Ryohei Fujimaki

Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. 1

6 0.69060177 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

7 0.67719615 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

8 0.67499679 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

9 0.67119372 249 nips-2013-Polar Operators for Structured Sparse Estimation

10 0.67068237 149 nips-2013-Latent Structured Active Learning

11 0.6684615 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

12 0.66483635 184 nips-2013-Marginals-to-Models Reducibility

13 0.66461241 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

14 0.6641925 63 nips-2013-Cluster Trees on Manifolds

15 0.6633206 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

16 0.66321534 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

17 0.66169274 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

18 0.66102862 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

19 0.66100395 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

20 0.66074109 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion