nips nips2013 nips2013-162 knowledge-graph by maker-knowledge-mining

162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement


Source: pdf

Author: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena

Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.1 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. [sent-5, score-0.399]

2 In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. [sent-6, score-0.534]

3 The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. [sent-7, score-1.36]

4 We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. [sent-8, score-0.629]

5 Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. [sent-9, score-0.375]

6 We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment. [sent-10, score-0.657]

7 An appropriate trajectory not only needs to be valid from a geometric standpoint (i. [sent-17, score-0.334]

8 Such user’s preferences over trajectories vary between users, between tasks, and between the environments the trajectory is performed in. [sent-20, score-0.755]

9 For example, a household robot should move a glass of water in an upright position without jerks while maintaining a safe distance from nearby electronic devices. [sent-21, score-0.559]

10 In another example, a robot checking out a kitchen knife at a grocery store should strictly move it at a safe distance from nearby humans. [sent-22, score-0.653]

11 For example, trajectories of heavy items should not pass over fragile items but rather move around them. [sent-24, score-0.405]

12 These preferences are often hard to describe and anticipate without knowing where and how the robot is deployed. [sent-25, score-0.598]

13 In this work we propose an algorithm for learning user preferences over trajectories through interactive feedback from the user in a co-active learning setting [31]. [sent-29, score-1.214]

14 Unlike in other learning settings, where a human first demonstrates optimal trajectories for a task to the robot, our learning model does not rely on the user’s ability to demonstrate optimal trajectories a priori. [sent-30, score-0.508]

15 From these interactive improvements the robot learns a general model of the user’s preferences in an online fashion. [sent-32, score-0.598]

16 We show empirically that a small number of such interactions is sufficient to adapt a robot to a changed task. [sent-33, score-0.408]

17 Since the user does not have to demonstrate a (near) optimal trajectory to the robot, we argue that our feedback is easier to provide and more widely applicable. [sent-34, score-0.852]

18 edu/coactive 1 Figure 1: Zero-G feedback: Learning trajectory preferences from sub-optimal zero-G feedback. [sent-39, score-0.524]

19 (Left) Robot plans a bad trajectory (waypoints 1-2-4) with knife close to flower. [sent-40, score-0.376]

20 As feedback, user corrects waypoint 2 and moves it to waypoint 3. [sent-41, score-0.413]

21 In our empirical evaluation, we learn preferences for a high DoF Baxter robot on a variety of grocery checkout tasks. [sent-43, score-0.85]

22 By designing expressive trajectory features, we show how our algorithm learns preferences from online user feedback on a broad range of tasks for which object properties are of particular importance (e. [sent-44, score-1.236]

23 We extensively evaluate our approach on a set of 16 grocery checkout tasks, both in batch experiments as well as through robotic experiments wherein users provide their preferences on the robot. [sent-47, score-0.586]

24 Our results show that robot trained using our algorithm not only quickly learns good trajectories on individual tasks, but also generalizes well to tasks that it has not seen before. [sent-48, score-0.704]

25 2 Related Work Teaching a robot to produce desired motions has been a long standing goal and several approaches have been studied. [sent-49, score-0.408]

26 In our setting, the user never discloses the optimal trajectory (or provide optimal feedback) to the robot, but instead, the robot learns preferences from sub-optimal suggestions on how the trajectory can be improved. [sent-54, score-1.541]

27 A recent work [37] leverages user feedback to learn rewards of a Markov decision process. [sent-58, score-0.55]

28 [5] in that it models sub-optimality in user feedback and theoretically converges to user’s hidden score function. [sent-61, score-0.572]

29 Our application scenario of learning trajectories for high DoF manipulations performing tasks in presence of different objects and environmental constraints goes beyond the application scenarios that previous works have considered. [sent-63, score-0.36]

30 We design appropriate features that consider robot configurations, object-object relations, and temporal behavior, and use them to learn a score function representing the preferences in trajectories. [sent-64, score-0.744]

31 [23] planned trajectories satisfying user specified preferences in form of constraints on the distance of robot from user, the visibility of robot and the user arm comfort. [sent-70, score-1.894]

32 [8] used functional gradients [29] to optimize for legibility of robot trajectories. [sent-73, score-0.408]

33 We differ from these in that we learn score functions reflecting user preferences from implicit feedback. [sent-74, score-0.551]

34 For a given task, the robot is given a context x that describes the environment, the objects, and any other input relevant to the problem. [sent-76, score-0.408]

35 The robot has to figure out what is a good trajectory y for this context. [sent-77, score-0.742]

36 Formally, we assume that the user has a scoring function s∗ (x, y) that reflects how much he values each trajectory y for context x. [sent-78, score-0.682]

37 Instead, we merely assume that the user can provide us with preferences that reflect this scoring function. [sent-82, score-0.538]

38 The learning process proceeds through the following repeated cycle of interactions between robot and user. [sent-85, score-0.408]

39 Step 2: The user either lets the robot execute the top-ranked trajectory, or corrects the robot by providing an improved trajectory y . [sent-88, score-1.425]

40 ¯ ¯ Step 3: The robot now updates the parameter w of s(x, y; w) based on this preference feedback and returns to step 1. [sent-90, score-0.704]

41 The robot’s performance will be measured in terms of regret, REGT = T 1 ∗ [s∗ (xt , yt ) − s∗ (xt , yt )], which compares the robot’s trajectory yt at each time step t t=1 T ∗ ∗ against the optimal trajectory yt maximizing the user’s unknown scoring function s∗ (x, y), yt = ∗ argmaxy s (xt , y). [sent-92, score-1.424]

42 Regret characterizes the performance of the robot over its whole lifetime, therefore reflecting how well it performs throughout the learning process. [sent-94, score-0.408]

43 Since the ability to easily give preference feedback in Step 2 is crucial for making the robot learning system easy to use for humans, we designed two feedback mechanisms that enable the user to easily provide improved trajectories. [sent-97, score-1.222]

44 User observers trajectories sequentially and clicks on the first trajectory which is better than the top ranked trajectory. [sent-99, score-0.606]

45 (b) Zero-G: This feedback allow users to improve trajectory waypoints by physically changing the robot’s arm configuration as shown in Figure 1. [sent-100, score-0.856]

46 This feedback is useful (i) for bootstrapping the robot, (ii) for avoiding local maxima where the top trajectories in the ranked list are all bad but ordered correctly, and (iii) when the user is satisfied with the top ranked trajectory except for minor errors. [sent-103, score-1.165]

47 A counterpart of this feedback is keyframe based LfD [2] where an expert demonstrates a sequence of optimal waypoints instead of the complete trajectory. [sent-104, score-0.401]

48 Note that in both re-ranking and zero-G feedback, the user never reveals the optimal trajectory to the algorithm but just provides a slightly improved trajectory. [sent-105, score-0.609]

49 s(x, y; w) = w · φ(x, y) (1) w is a weight vector that needs to be learned, and φ(·) are features describing trajectory y for context x. [sent-107, score-0.394]

50 We further decompose the score function in two parts, one only concerned with the objects the trajectory is interacting with, and the other with the object being manipulated and the environment. [sent-108, score-0.629]

51 A few of the trajectory waypoints would ¯ be affected by the other objects in the environment. [sent-118, score-0.509]

52 Specifically, we connect an object ok to a trajectory waypoint if the minimum distance to collision is less than a threshold or if ok lies below o. [sent-120, score-0.909]

53 The edge connecting ¯ yj and ok is denoted as (yj , ok ) ∈ E. [sent-121, score-0.364]

54 Since it is the attributes [19] of the object that really matter in determining the trajectory quality, we represent each object with its attributes. [sent-122, score-0.592]

55 , lk ], with each lk = {0, 1} indicating whether object ok possesses 3 Figure 2: (Left) A grocery checkout environment with a few objects where the robot was asked to checkout flowervase on the left to the right. [sent-125, score-1.277]

56 3 Then, for every (yj , ok ) edge, we extract following four features φoo (yj , ok ): projection of minimum distance to collision along x, y and z (vertical) axis and a binary variable, that is 1, if ok lies vertically below o, 0 otherwise. [sent-138, score-0.596]

57 ¯ We now define the score sO (·) over this graph as follows: M p lk lq [wpq · φoo (yj , ok )] sO (x, y; wO ) = (3) (yj ,ok )∈E p,q=1 Here, the weight vector wpq captures interaction between objects with properties p and q. [sent-139, score-0.389]

58 While a robot can reach the same operational space configuration for its wrist with different configurations of the arm, not all of them are preferred [38]. [sent-146, score-0.442]

59 Furthermore, humans like to anticipate robots move and to gain users’ confidence, robot should produce predictable and legible robot motion [8]. [sent-150, score-0.987]

60 We divide a trajectory into three parts in time and compute 9 features for each of the parts. [sent-155, score-0.394]

61 Object orientation during the trajectory is crucial in deciding its quality. [sent-160, score-0.37]

62 In ¯ detail, we divide the trajectory into three equal parts, and for each part we compute object’s: (i) minimum vertical distance from the nearest surface below it. [sent-181, score-0.401]

63 4 To capture Figure 3: (Top) A good and bad trajectory temporal variation of object’s distance from its surround- for moving a mug. [sent-184, score-0.4]

64 3 Computing Trajectory Rankings For obtaining the top trajectory (or a top few) for a given task with context x, we would like to maximize the current scoring function s(x, y; wO , wE ). [sent-192, score-0.407]

65 Fortunately, the latter problem is easy to solve and simply amounts to sorting the trajectories by their trajectory scores s(x, y (i) ; wO , wE ). [sent-200, score-0.565]

66 5 Since our primary goal is to learn a score function on sampled set of trajectories we now describe our learning algorithm and for more literature on sampling trajectories we refer the readers to [9]. [sent-206, score-0.548]

67 4 Learning the Scoring Function The goal is to learn the parameters wO and wE of the scoring function s(x, y; wO , wE ) so that it can be used to rank trajectories according to the user’s preferences. [sent-208, score-0.336]

68 Given a context xt , the top-ranked trajectory yt under the current parameters wO and wE , and the user’s feedback trajectory yt , the TPP updates the weights in the ¯ direction φO (xt , yt ) − φO (xt , yt ) and φE (xt , yt ) − φE (xt , yt ) respectively. [sent-211, score-1.765]

69 We merely need to characterize by how much the feedback improves on the presented ranking using the following definition of expected α-informative feedback: Et [s∗ (xt , yt )] ≥ s∗ (xt , yt ) + ¯ 4 We query PQP collision checker plugin of OpenRave for these distances. [sent-213, score-0.533]

70 The cost function (or its approximation) we learn can be fed to trajectory optimizers like CHOMP [29] or optimal planners like RRT* [15] to produce reasonably good trajectories. [sent-215, score-0.415]

71 This definition states that the user feedback should have a score of yt that is—in expectation over the users choices—higher than that of yt by a fraction ¯ α ∈ (0, 1] of the maximum possible range s∗ (xt , yt ) − s∗ (xt , yt ). [sent-217, score-1.188]

72 , y (n) } (t) (t) yt = argmaxy s(xt , y; wO , wE ) Obtain user feedback yt ¯ (t+1) (t) wO ← wO + φO (xt , yt ) − φO (xt , yt ) ¯ (t+1) (t) wE ← wE + φE (xt , yt ) − φE (xt , yt ) ¯ end for 5. [sent-229, score-1.332]

73 We evaluate our approach on 16 pick-and-place robotic tasks in a grocery store checkout setting. [sent-231, score-0.369]

74 We evaluate the quality of trajectories after the robot has grasped the items and while it moves them for checkout. [sent-233, score-0.68]

75 Our work complements previous works on grasping items [30, 21], pick and place tasks [11], and detecting bar code for grocery checkout [16]. [sent-234, score-0.36]

76 Hence the object’s properties and the way robot moves it in the environment is more relevant. [sent-236, score-0.513]

77 Our object-object interaction features allow the algorithm to learn preferences on trajectories for moving fragile objects like glasses and egg cartons, Figure 4 (middle). [sent-239, score-0.744]

78 3) Human centric: Sudden movements by the robot put the human in a danger of getting hurt. [sent-240, score-0.454]

79 We consider activities where a robot manipulates sharp objects, e. [sent-241, score-0.442]

80 MMP attempts to make an expert’s trajectory better than any other trajectory by a Figure 4: (Left) Manipulation centric: a box of cornflakes doesn’t interact much with surrounding items and is indifferent to orientation. [sent-255, score-0.779]

81 We therefore train MMP from online user feedback observed on a set of trajectories. [sent-261, score-0.518]

82 At every iteration we train a structural support vector machine (SSVM) [14] using all previous feedback as training examples, and use the learned weights to predict trajectory scores for the next iteration. [sent-263, score-0.577]

83 In addition to performing a user study on Baxter robot (Section 5. [sent-267, score-0.683]

84 While nDCG@1 is a suitable metric for autonomous robots that execute the top ranked trajectory, nDCG@3 is suitable for scenarios where the robot is supervised by humans. [sent-272, score-0.552]

85 Manipulation Environment Human Algorithms Mean forms better than baseline algorithms, centric centric centric Geometric 0. [sent-291, score-0.375]

86 Object trajectory features capture preferences related to the orientation of the object. [sent-370, score-0.62]

87 Robot arm configuration and object environment features capture preferences by detecting undesirable contorted arm configurations and maintaining safe distance from surrounding surfaces, respectively. [sent-371, score-0.823]

88 7 nDCG@3 (a) Same environment, different object (b) New Environment, same object (c) New Environment, different object Figure 5: Study of generalization with change in object, environment and both. [sent-378, score-0.492]

89 3 Robotic Experiment: User Study in learning trajectories We perform a user study of our system on Baxter robot on a variety of tasks of varying difficulties. [sent-381, score-0.979]

90 Thereby, showing our approach is practically realizable, and that the combination of re-rank and zero-G feedbacks allows the users to train the robot in few feedbacks. [sent-382, score-0.545]

91 A set of 10 tasks of varying difficulty level was presented to users one at a time, and they were instructed to provide feedback until they were satisfied with the top ranked trajectory. [sent-385, score-0.441]

92 To quantify the quality of learning each user evaluated their own trajectories (self score), the trajectories learned of the other users (cross score), and those predicted by Oracle-svm, on a Likert scale of 1-5 (where 5 is the best). [sent-386, score-0.829]

93 We also recorded the time a user took for each task—from start of training till the user was satisfied. [sent-387, score-0.55]

94 The study Table 2: Shows learning statistics for each user averaged over shows each user on an average took 3 re- all tasks. [sent-389, score-0.55]

95 rank and 2 zero-G feedbacks to train Bax# Re-ranking # Zero-G Average Trajectory Quality feedback feedback time (min. [sent-391, score-0.531]

96 (Right) Bar chart showFuture research in human computer inter- ing the average number of feedback and time required for action, visualization and better user inter- each task. [sent-453, score-0.564]

97 Unlike in standard learning from demonstration approaches, our framework does not require the user to provide optimal trajectories as training data, but can learn from iterative improvements. [sent-462, score-0.538]

98 In particular, we propose a set of trajectory features for which the TPP generalizes well on tasks which the robot has not seen before. [sent-464, score-0.867]

99 A bayesian approach for policy learning from trajectory preference queries. [sent-716, score-0.387]

100 Making planned paths look more human-like in humanoid robot manipulation planning. [sent-725, score-0.461]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('robot', 0.408), ('trajectory', 0.334), ('user', 0.275), ('tpp', 0.249), ('feedback', 0.243), ('trajectories', 0.231), ('preferences', 0.19), ('wo', 0.18), ('ok', 0.159), ('yt', 0.131), ('object', 0.129), ('centric', 0.125), ('ndcg', 0.113), ('waypoints', 0.111), ('checkout', 0.11), ('grocery', 0.11), ('environment', 0.105), ('users', 0.092), ('robotic', 0.084), ('dof', 0.083), ('mmp', 0.083), ('rss', 0.079), ('arm', 0.076), ('untrained', 0.074), ('scoring', 0.073), ('surrounding', 0.07), ('robots', 0.07), ('waypoint', 0.069), ('xt', 0.068), ('baxter', 0.067), ('tasks', 0.065), ('objects', 0.064), ('fragile', 0.061), ('features', 0.06), ('demonstrations', 0.058), ('contorted', 0.055), ('ijrr', 0.055), ('ratliff', 0.055), ('sisbot', 0.055), ('score', 0.054), ('manipulation', 0.053), ('preference', 0.053), ('lfd', 0.049), ('planners', 0.049), ('manipulated', 0.048), ('expert', 0.047), ('human', 0.046), ('yj', 0.046), ('feedbacks', 0.045), ('knife', 0.042), ('egg', 0.042), ('obj', 0.042), ('motion', 0.042), ('guration', 0.042), ('carton', 0.042), ('koppula', 0.042), ('oo', 0.042), ('owervase', 0.042), ('saxena', 0.042), ('wpq', 0.042), ('ranked', 0.041), ('items', 0.041), ('lk', 0.041), ('regret', 0.041), ('bagnell', 0.04), ('icra', 0.04), ('manipulators', 0.037), ('ssvm', 0.037), ('rrt', 0.037), ('orientation', 0.036), ('vertical', 0.036), ('moving', 0.035), ('activities', 0.034), ('grasping', 0.034), ('manipulator', 0.034), ('wrist', 0.034), ('elbow', 0.034), ('planning', 0.034), ('autonomous', 0.033), ('learn', 0.032), ('self', 0.032), ('move', 0.031), ('distance', 0.031), ('safe', 0.031), ('household', 0.03), ('imitation', 0.03), ('interaction', 0.029), ('jiang', 0.029), ('manual', 0.029), ('collision', 0.028), ('surfaces', 0.028), ('alami', 0.028), ('argmaxy', 0.028), ('chomp', 0.028), ('dragan', 0.028), ('env', 0.028), ('jerks', 0.028), ('legible', 0.028), ('likert', 0.028), ('mainprice', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

Author: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena

Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.1 1

2 0.26970196 255 nips-2013-Probabilistic Movement Primitives

Author: Alexandros Paraschos, Christian Daniel, January Peters, Gerhard Neumann

Abstract: Movement Primitives (MP) are a well-established approach for representing modular and re-usable robot movement generators. Many state-of-the-art robot learning successes are based MPs, due to their compact representation of the inherently continuous and high dimensional robot movements. A major goal in robot learning is to combine multiple MPs as building blocks in a modular control architecture to solve complex tasks. To this effect, a MP representation has to allow for blending between motions, adapting to altered task variables, and co-activating multiple MPs in parallel. We present a probabilistic formulation of the MP concept that maintains a distribution over trajectories. Our probabilistic approach allows for the derivation of new operations which are essential for implementing all aforementioned properties in one framework. In order to use such a trajectory distribution for robot movement control, we analytically derive a stochastic feedback controller which reproduces the given trajectory distribution. We evaluate and compare our approach to existing methods on several simulated as well as real robot scenarios. 1

3 0.14700136 165 nips-2013-Learning from Limited Demonstrations

Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup

Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1

4 0.13984436 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz

Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

5 0.13436376 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

Author: Jose Bento, Nate Derbinsky, Javier Alonso-Mora, Jonathan S. Yedidia

Abstract: We describe a novel approach for computing collision-free global trajectories for p agents with specified initial and final configurations, based on an improved version of the alternating direction method of multipliers (ADMM). Compared with existing methods, our approach is naturally parallelizable and allows for incorporating different cost functionals with only minor adjustments. We apply our method to classical challenging instances and observe that its computational requirements scale well with p for several cost functionals. We also show that a specialization of our algorithm can be used for local motion planning by solving the problem of joint optimization in velocity space. 1

6 0.12143077 348 nips-2013-Variational Policy Search via Trajectory Optimization

7 0.11820601 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

8 0.11284514 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

9 0.09911298 7 nips-2013-A Gang of Bandits

10 0.097563073 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations

11 0.084499769 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

12 0.079716474 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

13 0.077952527 230 nips-2013-Online Learning with Costly Features and Labels

14 0.072671928 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

15 0.065468684 84 nips-2013-Deep Neural Networks for Object Detection

16 0.060482066 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

17 0.058475994 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

18 0.057419028 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

19 0.056956705 85 nips-2013-Deep content-based music recommendation

20 0.055206958 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.158), (1, -0.071), (2, -0.043), (3, -0.051), (4, 0.004), (5, -0.045), (6, 0.031), (7, 0.029), (8, -0.032), (9, -0.052), (10, -0.105), (11, -0.103), (12, -0.037), (13, 0.022), (14, -0.149), (15, 0.06), (16, -0.071), (17, 0.056), (18, 0.048), (19, 0.054), (20, -0.006), (21, -0.12), (22, 0.055), (23, 0.034), (24, -0.068), (25, -0.061), (26, 0.103), (27, 0.048), (28, 0.029), (29, -0.072), (30, 0.035), (31, 0.045), (32, 0.141), (33, -0.101), (34, -0.067), (35, 0.034), (36, -0.131), (37, -0.079), (38, -0.026), (39, -0.168), (40, -0.062), (41, 0.138), (42, 0.015), (43, -0.15), (44, 0.214), (45, -0.089), (46, -0.211), (47, -0.134), (48, 0.048), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97203732 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

Author: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena

Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.1 1

2 0.86471099 255 nips-2013-Probabilistic Movement Primitives

Author: Alexandros Paraschos, Christian Daniel, January Peters, Gerhard Neumann

Abstract: Movement Primitives (MP) are a well-established approach for representing modular and re-usable robot movement generators. Many state-of-the-art robot learning successes are based MPs, due to their compact representation of the inherently continuous and high dimensional robot movements. A major goal in robot learning is to combine multiple MPs as building blocks in a modular control architecture to solve complex tasks. To this effect, a MP representation has to allow for blending between motions, adapting to altered task variables, and co-activating multiple MPs in parallel. We present a probabilistic formulation of the MP concept that maintains a distribution over trajectories. Our probabilistic approach allows for the derivation of new operations which are essential for implementing all aforementioned properties in one framework. In order to use such a trajectory distribution for robot movement control, we analytically derive a stochastic feedback controller which reproduces the given trajectory distribution. We evaluate and compare our approach to existing methods on several simulated as well as real robot scenarios. 1

3 0.5102284 165 nips-2013-Learning from Limited Demonstrations

Author: Beomjoon Kim, Amir massoud Farahmand, Joelle Pineau, Doina Precup

Abstract: We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert’s suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task. 1

4 0.47524881 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

Author: Josh S. Merel, Roy Fox, Tony Jebara, Liam Paninski

Abstract: In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user’s neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modified decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.

5 0.42724064 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

Author: Jose Bento, Nate Derbinsky, Javier Alonso-Mora, Jonathan S. Yedidia

Abstract: We describe a novel approach for computing collision-free global trajectories for p agents with specified initial and final configurations, based on an improved version of the alternating direction method of multipliers (ADMM). Compared with existing methods, our approach is naturally parallelizable and allows for incorporating different cost functionals with only minor adjustments. We apply our method to classical challenging instances and observe that its computational requirements scale well with p for several cost functionals. We also show that a specialization of our algorithm can be used for local motion planning by solving the problem of joint optimization in velocity space. 1

6 0.39843953 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

7 0.3347117 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

8 0.33135957 348 nips-2013-Variational Policy Search via Trajectory Optimization

9 0.32369679 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

10 0.3084814 85 nips-2013-Deep content-based music recommendation

11 0.3079432 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

12 0.30726144 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations

13 0.30723998 230 nips-2013-Online Learning with Costly Features and Labels

14 0.29816914 7 nips-2013-A Gang of Bandits

15 0.28544649 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

16 0.28263831 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs

17 0.28111371 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

18 0.27495524 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

19 0.2707901 226 nips-2013-One-shot learning by inverting a compositional causal process

20 0.27067545 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.01), (16, 0.025), (33, 0.09), (34, 0.077), (41, 0.023), (49, 0.061), (56, 0.121), (70, 0.095), (77, 0.013), (83, 0.257), (85, 0.047), (89, 0.015), (93, 0.049), (95, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81101775 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

Author: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena

Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.1 1

2 0.70739442 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

Author: Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum

Abstract: The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs (GPGP) consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood. Representations and algorithms from computer graphics are used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on generalpurpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and yields accurate, approximately Bayesian inferences about real-world images. 1

3 0.68694699 5 nips-2013-A Deep Architecture for Matching Short Texts

Author: Zhengdong Lu, Hang Li

Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1

4 0.61533391 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average

Author: Yao-Liang Yu

Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1

5 0.61232078 121 nips-2013-Firing rate predictions in optimal balanced networks

Author: David G. Barrett, Sophie Denève, Christian K. Machens

Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1

6 0.60840636 157 nips-2013-Learning Multi-level Sparse Representations

7 0.60509717 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

8 0.59862828 15 nips-2013-A memory frontier for complex synapses

9 0.59426224 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

10 0.59238869 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

11 0.58167404 249 nips-2013-Polar Operators for Structured Sparse Estimation

12 0.5808965 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

13 0.5781731 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

14 0.57619345 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

15 0.57488012 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

16 0.57411331 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

17 0.57250291 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

18 0.57077432 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

19 0.5705052 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

20 0.56995279 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching