nips nips2013 nips2013-278 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. [sent-6, score-0.54]
2 A novel aspect of our transfer approach is that we reuse reward functions. [sent-7, score-0.969]
3 While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. [sent-8, score-1.512]
4 Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. [sent-9, score-3.224]
5 Other transfer approaches have considered parameter transfer [19], selective reuse of sample trajectories from previous tasks [7], as well as reuse of learned abstract representations such as options [12, 6]. [sent-15, score-0.612]
6 A novel aspect of our transfer approach in long-lived agents is that we will reuse reward functions. [sent-16, score-1.094]
7 At first blush, it may seem odd to consider using a reward function different from the one specifying the current task in the sequence (indeed, in most RL research rewards are considered an immutable part of the task description). [sent-17, score-0.926]
8 But there is now considerable work on designing good reward functions, including reward-shaping [10], inverse RL [11], optimal rewards [13] and preference-elicitation [3]. [sent-18, score-0.765]
9 [14] that learns good guidance reward functions incrementally in a single-task setting. [sent-22, score-0.829]
10 1 In the grid world domain only the task-specifying reward function changes with tasks, while in the networking domain both the reward function and the state transition function change with tasks. [sent-25, score-1.587]
11 A task in such a CMP is defined via a reward function R that maps state-action pairs to scalar values. [sent-27, score-0.78]
12 The objective of the agent in a task is to execute the optimal policy, i. [sent-28, score-0.57]
13 , to choose actions in such a way as to optimize utility defined as the expected value of cumulative reward over some lifetime. [sent-30, score-0.758]
14 A CMP and reward function together define a Markov decision process or MDP; hence tasks in this paper are MDPs. [sent-31, score-0.776]
15 It simulates a number of trajectories from the current state up to some maximum depth, choosing actions at each point based on the sum of an estimated action-value that encourages exploitation and a reward bonus that encourages exploration. [sent-34, score-0.867]
16 We use UCT in this paper because it is one of the state of the art algorithms in RL planning and because there exists a good optimal reward finding algorithm for it [14]. [sent-36, score-0.795]
17 In almost all of RL research, the reward function is considered part of the task specification and thus unchangeable. [sent-38, score-0.78]
18 [13] stems from the observation that a reward function plays two roles simultaneously in RL problems. [sent-40, score-0.706]
19 The first role is that of evaluation in that the task-specifying reward function is used by the agent designer to evaluate the actual behavior of the agent. [sent-41, score-1.149]
20 The second is that of guidance in that the reward function is also used by the RL algorithm implemented by the agent to determine its behavior (e. [sent-42, score-1.178]
21 The optimal rewards problem separates these two roles into two separate reward functions, the task-specifying objective reward function used to evaluate performance, and an internal reward function used to guide agent behavior. [sent-45, score-2.837]
22 The optimal internal reward function will depend on the agent A’s architecture and its limitations, and this distinguishes ORP from other reward-design approaches such as inverse-RL. [sent-47, score-1.311]
23 When would the optimal internal reward function be different from the objective reward function? [sent-48, score-1.679]
24 If an agent is unbounded in its capabilities with respect to the CMP then the objective reward function is always an optimal internal reward function. [sent-49, score-2.088]
25 More crucially though, in the realistic setting of bounded agents, optimal internal reward functions may be quite different from objective reward functions. [sent-50, score-1.705]
26 [14] provide many examples and some theory of when a good choice of internal reward can mitigate agent bounds, including bounds corresponding to limited lifetime to learn [13], limited memory [14], and limited resources for planning (the specific bound of interest in this paper). [sent-53, score-1.351]
27 ’s [14, 15] policy gradient reward design (PGRD) method that is based on the insight that any planning algorithm can be viewed as procedurally translating the internal reward function Ri into behavior—that is, Ri are indirect parameters of the agent’s policy. [sent-58, score-1.722]
28 The sequence of objective reward parameters and task durations for n tasks are shown in the environment portion of each figure. [sent-63, score-0.982]
29 The critic-agent translates the objective reward parameters θo into the internal reward parameters θi . [sent-65, score-1.663]
30 We will consider the case where objective rewards are linear functions of objective reward features. [sent-69, score-0.917]
31 Formally, the j th task is defined by objective reward o o o function Rj (s, a) = θj · ψ o (s, a), where θj is the parameter vector for the j th task, ψ o are the taskindependent objective reward features of state and action, and ‘·’ denotes the inner-product. [sent-70, score-1.659]
32 Given some agent A the expected objective utility achieved for a particular task sequence o K θj o {θj , tj }K , is Eh∼ A,M j=1 j=1 U (hj ) , where for ease of exposition we denote the history during task j simply as hj . [sent-73, score-0.69]
33 In some transfer or other long-lived agent research, the emphasis is on learning in that the agent is assumed to lack complete knowledge of the CMP and the task specifications. [sent-75, score-1.1]
34 Our emphasis here is on planning in that the agent is assumed to know the CMP perfectly as well as the task specifications as they change. [sent-76, score-0.539]
35 If the agent were unbounded in planning capacity, there would be nothing interesting left to consider because the agent could simply find the optimal policy for each new task and execute it. [sent-77, score-1.038]
36 Note that basic UCT does use a reward function but does not use an initial value function or policy and hence changing a reward function is a natural and consequential way to influence UCT. [sent-80, score-1.514]
37 In addition, in our setting a model of the CMP is available to the agent and so there is no scope for transfer by reuse of model knowledge. [sent-82, score-0.672]
38 Thus, our reuse of reward functions may well be the most consequential option available in UCT. [sent-83, score-0.815]
39 Next we discuss four different agent architectures represented graphically in Figure 1, starting with a conventional agent that ignores both the potential of transfer and that of ORP, followed by three different agents that do not to varying degrees. [sent-84, score-1.259]
40 Figure 1(a) shows the baseline conventional UCT-based agent that ignores the possibility of transfer and treats each task separately. [sent-86, score-0.766]
41 It also ignores ORP and treats each task’s objective reward as the internal reward for UCT planning during that task. [sent-87, score-1.739]
42 The remaining three agents will all consider the ORP, and share the following details: The space of internal reward functions R is the space of all linear functions of internal reward features ψ i (s, a), i. [sent-88, score-1.963]
43 Note that the internal reward features ψ i and the objective reward features ψ o do not have to be identical. [sent-91, score-1.691]
44 Figure 1(b) shows the non-transfer agent that ignores the possibility of transfer but exploits ORP. [sent-93, score-0.637]
45 It initializes the internal reward function to the objective reward function of each new task as it starts and then uses PGRD to adapt the internal reward function while acting in that task. [sent-94, score-2.623]
46 This agent was designed to help separate the contributions of ORP and transfer to performance gains. [sent-96, score-0.617]
47 It exploits both transfer and ORP via incrementally learning a reward mapping function. [sent-99, score-1.044]
48 A reward mapping function f maps objective reward function parameters to ini o ternal reward function parameters: ∀j, θj = f (θj ). [sent-100, score-2.285]
49 The reward mapping function is used to initialize the internal reward function at the beginning of each new task. [sent-101, score-1.713]
50 PGRD is used to continually adapt the initialized internal reward function throughout each task. [sent-102, score-0.903]
51 The reward mapping function is incrementally trained as follows: when task j ends, the objective o ˆi reward function parameters θj and the adapted internal reward function parameters θj are used as an input-output pair to update the reward mapping function. [sent-103, score-3.375]
52 In our work, we use nonparametric kernel-regression to learn the reward mapping function. [sent-104, score-0.802]
53 Pseudocode for a general reward mapping agent is presented in Algorithm 1. [sent-105, score-1.211]
54 However, it does not use a reward mapping function but instead continually updates the internal reward function across task boundaries using PGRD. [sent-109, score-1.779]
55 The internal reward function at the end of a task becomes the initial internal reward function at the start of the next task achieving a simple form of sequential transfer. [sent-110, score-1.95]
56 4 Algorithm 1 General pseudocode for Reward Mapping Agent (Figure 1(c)) o o 1: Input: {θj , tj }k , where j is task indicator, tj is task duration, and θj are the objective reward j=1 function parameters specifying task j. [sent-114, score-1.041]
57 Food appears at the rightmost column of one of the three corridors and can be eaten by the agent when the agent is at the same location with the food. [sent-128, score-0.852]
58 The agent eats food and repairs shelter automatically whenever collocated with food and shelter respectively. [sent-132, score-0.805]
59 At each time step, the agent receives a positive reward of e (the eatbonus) for eating food and a negative reward of b (the broken-cost) if the shelter is broken. [sent-138, score-2.043]
60 Thus, o the objective reward function’s parameters are θj = (ej , bj ), where ej ∈ [0, 1] and bj ∈ [−1, 0]. [sent-139, score-0.839]
61 If (ej , bj ) = (0, -1), the agent should remain at the shelter’s location in order to repair the shelter as it breaks. [sent-142, score-0.571]
62 The internal reward function is Rj (s) = Rj (s) + θj ψ i (s), 1 o i where Rj (s) is the objective reward function, ψ (s) = 1 − nl (s) is the inverse recency feature 5 0. [sent-144, score-1.663]
63 (Middle and Right) Comparing performance while accounting for computational overhead of learning and using the reward mapping function. [sent-167, score-0.825]
64 Since i i there is exactly one internal reward parameter, θj is a scalar. [sent-170, score-0.886]
65 A positive θj encourages the agent to i visit locations not visited recently, and a negative θj encourages the agent to visit locations visited recently. [sent-171, score-0.908]
66 100 sequences of 200 tasks were generated, with Poisson distributions for task durations, and with objective reward function parameters sampled uniformly from their ranges. [sent-173, score-0.903]
67 The agents used UCT with depth 2 and 500 trajectories; the conventional agent is thereby bounded as evidenced in its poor performance (see Figure 3). [sent-174, score-0.604]
68 16 eat bonus The left panel in Figure 3 shows average objective reward per time step (with standard error bars). [sent-338, score-0.846]
69 For each task duration the reward mapping agent does best and the conventional agent does the worst. [sent-340, score-1.803]
70 These results demonstrate transfer helps performance and that transfer via the new reward mapping approach can substantially improve a bounded longlived agent’s performance relative to transfer via the competing method of sequential transfer. [sent-341, score-1.485]
71 This is expected because the longer the task duration the more time PGRD has to adapt to the task, and thus the less the better initialization provided by the reward mapping function matters. [sent-343, score-0.93]
72 07 In addition, the sequential transfer agent does better than the 0. [sent-377, score-0.647]
73 11 non-transfer agent for the shortest task duration of 50 while the 0. [sent-388, score-0.537]
74 Correcting the internal reward function ward Mapping agent after 50 tasks. [sent-422, score-1.295]
75 These effects are exacerbated by longer task durations because the agent then has longer to adapt its internal reward function to each task. [sent-424, score-1.4]
76 In general, as task duration increases, the non-transfer agent improves but the sequential transfer agent worsens. [sent-425, score-1.184]
77 The above results ignore the computational overhead incurred by learning and using the reward mapping function. [sent-427, score-0.825]
78 The two rightmost plots in the bottom row of Figure 3 show the average objective reward per time step as a function of milliseconds per decision for the four agent architectures for a range of depth {1, . [sent-428, score-1.278]
79 This can be seen by observing that the highest dot at any vertical column on the x-axis belongs to the reward mapping agent. [sent-436, score-0.802]
80 Thus, the overhead of the reward mapping function in the reward mapping agent is insignificant relative to the computational cost of UCT (this last cost is all the conventional agent incurs). [sent-437, score-2.5]
81 Using a fixed set of tasks (as described above) with mean duration of 500, we estimated the optimal internal reward parameter (the coefficient of the inverse-recency feature) for UCT by a brute-force grid search. [sent-439, score-1.008]
82 The optimal internal reward parameter is visualized as a function of the two parameters of the objective reward function (broken cost and eat bonus) in Figure 4, top. [sent-440, score-1.716]
83 As would be expected the top right corner (high penalty for broken shelter and low reward for eating) discourages exploration while the bottom left corner (high reward for eating and low cost for broken shelter) encourages exploration. [sent-442, score-1.674]
84 Figure 4, bottom, visualizes the learned reward mapping function after training on 50 tasks. [sent-443, score-0.802]
85 The weights of these three features are the parameters of the objective reward function. [sent-472, score-0.791]
86 2, 0); different choices of these weights correspond to different objective reward functions. [sent-474, score-0.777]
87 The internal reward function for the agent at node k is Rj,k (s, a) = o i i o i Rj (s, a) + θj,k ψk (s, a), where Rj (s, a) is the objective reward function, ψk (s, a) is a binary feature vector with one binary feature for each (packet destination, action) pair. [sent-476, score-2.096]
88 The internal reward features are capable of representing arbitrary policies (and thus we also implemented classical policy gradient with these features using OLPOMDP [2] but found it to be far slower than the use of PGRD with UCT and hence don’t present those results here). [sent-478, score-1.015]
89 The parameters describing the transition function are concatenated with the parameters defining the objective reward function and used as input to the reward mapping function (whose output remains the initial internal reward function). [sent-480, score-2.521]
90 The competing policy transfer agent from [9] reuses policy knowledge across tasks based on a model-based average-reward RL algorithm. [sent-482, score-0.846]
91 Their policy selection criterion was designed for the case when only the linear reward parameters change. [sent-484, score-0.78]
92 However, in our experiments, tasks could differ in three different ways: (1) only reward functions change, (2) only transition functions change, and (3) both reward functions and transition functions change. [sent-485, score-1.694]
93 (Left) tasks differ in objective reward functions (R) only. [sent-498, score-0.869]
94 (Right) tasks differ in both objective reward and transition (R and T) functions. [sent-500, score-0.899]
95 Three sets of 100 task sequences were generated, one in which the tasks differed in objective reward function only, another in which they differed in state transition function only, and third in which they differed in both. [sent-503, score-1.045]
96 Figure 5 compares the average objective reward per time step for all four agents defined above as well as the competing policy transfer agent on the three sets. [sent-504, score-1.638]
97 In all cases, the reward-mapping agent works best and the conventional agent worst. [sent-505, score-0.873]
98 The competing policy transfer agent is second best when only the reward-function changes—just the setting for which it was designed. [sent-506, score-0.72]
99 We demonstrated that our reward mapping approach can outperform alternate approaches; current and future work is focused on greater theoretical understanding of the general conditions under which this is true. [sent-509, score-0.817]
100 Policy invariance under reward transformations: Theory and application to reward shaping. [sent-545, score-1.412]
wordName wordTfidf (topN-words)
[('reward', 0.706), ('agent', 0.409), ('transfer', 0.208), ('internal', 0.18), ('orp', 0.18), ('packet', 0.137), ('uct', 0.137), ('shelter', 0.127), ('agents', 0.125), ('cmp', 0.106), ('mapping', 0.096), ('pgrd', 0.085), ('routing', 0.077), ('task', 0.074), ('policy', 0.074), ('objective', 0.071), ('food', 0.071), ('guidance', 0.063), ('destination', 0.06), ('reinforcement', 0.056), ('transition', 0.056), ('planning', 0.056), ('conventional', 0.055), ('reuse', 0.055), ('duration', 0.054), ('actoragent', 0.053), ('sorg', 0.053), ('tasks', 0.052), ('queue', 0.046), ('rl', 0.044), ('packets', 0.043), ('rewards', 0.043), ('broken', 0.042), ('ri', 0.039), ('satinder', 0.037), ('eat', 0.037), ('networking', 0.034), ('rj', 0.034), ('domain', 0.034), ('environment', 0.034), ('incrementally', 0.034), ('bonus', 0.032), ('durations', 0.031), ('action', 0.031), ('sequential', 0.03), ('competing', 0.029), ('consequential', 0.028), ('singh', 0.028), ('utility', 0.027), ('encourages', 0.027), ('policies', 0.027), ('functions', 0.026), ('milliseconds', 0.026), ('initialize', 0.025), ('actions', 0.025), ('node', 0.024), ('eating', 0.024), ('nodes', 0.024), ('overhead', 0.023), ('differed', 0.023), ('peter', 0.023), ('tn', 0.022), ('ej', 0.022), ('maze', 0.021), ('tj', 0.021), ('consumption', 0.02), ('ignores', 0.02), ('bj', 0.02), ('jonathan', 0.019), ('andrew', 0.019), ('eaten', 0.019), ('designer', 0.019), ('michigan', 0.018), ('decision', 0.018), ('trajectories', 0.018), ('st', 0.018), ('visit', 0.018), ('queues', 0.017), ('ro', 0.017), ('konidaris', 0.017), ('continually', 0.017), ('architectures', 0.017), ('delay', 0.017), ('richard', 0.017), ('state', 0.017), ('four', 0.016), ('optimal', 0.016), ('options', 0.016), ('location', 0.015), ('depth', 0.015), ('current', 0.015), ('eh', 0.015), ('matthew', 0.015), ('autonomous', 0.015), ('evaluation', 0.015), ('guiding', 0.015), ('seventeenth', 0.015), ('sequence', 0.014), ('features', 0.014), ('differ', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
2 0.30647078 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz
Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.
3 0.24352929 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
4 0.23271704 129 nips-2013-Generalized Random Utility Models with Multiple Types
Author: Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, David C. Parkes
Abstract: We propose a model for demand estimation in multi-agent, differentiated product settings and present an estimation algorithm that uses reversible jump MCMC techniques to classify agents’ types. Our model extends the popular setup in Berry, Levinsohn and Pakes (1995) to allow for the data-driven classification of agents’ types using agent-level data. We focus on applications involving data on agents’ ranking over alternatives, and present theoretical conditions that establish the identifiability of the model and uni-modality of the likelihood/posterior. Results on both real and simulated data provide support for the scalability of our approach. 1
5 0.22454648 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
Author: Shunan Zhang, Angela J. Yu
Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1
7 0.21046102 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
8 0.20709695 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
9 0.19178092 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
10 0.18397853 7 nips-2013-A Gang of Bandits
11 0.1635557 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
12 0.15877421 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
13 0.15611775 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
14 0.1490777 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
15 0.13972318 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning
16 0.13716681 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
17 0.13641123 347 nips-2013-Variational Planning for Graph-based MDPs
18 0.12865008 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
19 0.12365501 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
20 0.098875836 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
topicId topicWeight
[(0, 0.171), (1, -0.276), (2, -0.127), (3, 0.041), (4, 0.007), (5, -0.071), (6, 0.115), (7, -0.075), (8, -0.122), (9, 0.011), (10, 0.026), (11, 0.116), (12, 0.05), (13, -0.054), (14, -0.213), (15, 0.282), (16, 0.048), (17, 0.203), (18, 0.168), (19, -0.082), (20, -0.079), (21, -0.001), (22, -0.001), (23, -0.031), (24, -0.047), (25, 0.03), (26, -0.003), (27, 0.015), (28, 0.021), (29, 0.033), (30, 0.067), (31, 0.011), (32, -0.072), (33, 0.047), (34, -0.052), (35, -0.032), (36, 0.05), (37, -0.018), (38, -0.08), (39, 0.039), (40, 0.085), (41, -0.025), (42, 0.09), (43, 0.114), (44, -0.04), (45, 0.034), (46, -0.065), (47, 0.106), (48, 0.112), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.98681533 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
2 0.78432691 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz
Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.
3 0.66997069 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
Author: Liam C. MacDermed, Charles Isbell
Abstract: We present four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard infinite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the first tractable value iteration method for DecPOMDPs that outperforms existing algorithms. 1
4 0.58131355 129 nips-2013-Generalized Random Utility Models with Multiple Types
Author: Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, David C. Parkes
Abstract: We propose a model for demand estimation in multi-agent, differentiated product settings and present an estimation algorithm that uses reversible jump MCMC techniques to classify agents’ types. Our model extends the popular setup in Berry, Levinsohn and Pakes (1995) to allow for the data-driven classification of agents’ types using agent-level data. We focus on applications involving data on agents’ ranking over alternatives, and present theoretical conditions that establish the identifiability of the model and uni-modality of the likelihood/posterior. Results on both real and simulated data provide support for the scalability of our approach. 1
5 0.56758082 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
Author: Shunan Zhang, Angela J. Yu
Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1
6 0.55898863 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
7 0.52637184 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games
8 0.50681585 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
9 0.50457484 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
10 0.49973065 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
11 0.48525837 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
12 0.48465949 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
13 0.47715169 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
14 0.46369201 7 nips-2013-A Gang of Bandits
15 0.45643049 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning
16 0.39497092 347 nips-2013-Variational Planning for Graph-based MDPs
17 0.39320245 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
18 0.39090148 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
19 0.38780272 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
20 0.38454339 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
topicId topicWeight
[(16, 0.031), (33, 0.085), (34, 0.16), (41, 0.03), (49, 0.039), (56, 0.147), (70, 0.043), (74, 0.167), (85, 0.059), (88, 0.018), (89, 0.033), (93, 0.057), (95, 0.015)]
simIndex simValue paperId paperTitle
1 0.88366455 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi
Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difficulty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difficulty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efficient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We first prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efficiency, we also propose an efficient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1
same-paper 2 0.87695211 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
3 0.84869623 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
4 0.84731627 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel
Author: Ping Li, Gennady Samorodnitsk, John Hopcroft
Abstract: The method of stable random projections is useful for efficiently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments confirm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1
5 0.83877158 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
6 0.83553344 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
7 0.80481565 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
8 0.79194772 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
9 0.78458017 347 nips-2013-Variational Planning for Graph-based MDPs
10 0.78431374 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
11 0.78358114 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
12 0.78299159 249 nips-2013-Polar Operators for Structured Sparse Estimation
13 0.78268707 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
14 0.78218877 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
15 0.7821821 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
16 0.77923912 63 nips-2013-Cluster Trees on Manifolds
17 0.77878708 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
18 0.77679181 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
19 0.77607983 184 nips-2013-Marginals-to-Models Reducibility
20 0.77535135 348 nips-2013-Variational Policy Search via Trajectory Optimization