nips nips2008 nips2008-173 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Paul L. Ruvolo, Ian Fasel, Javier R. Movellan
Abstract: Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based “controllers” that modulate the behavior of the optimizer during the optimization process. For example, in the LMA a damping parameter λ is dynamically modified based on a set of rules that were developed using heuristic arguments. Reinforcement learning (RL) is a machine learning approach to learn optimal controllers from examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. Here we show that a popular modern reinforcement learning technique using a very simple state space can dramatically improve the performance of general purpose optimizers, like the LMA. Surprisingly the controllers learned for a particular domain also work well in very different optimization domains. For example we used RL methods to train a new controller for the damping parameter of the LMA. This controller was trained on a collection of classic, relatively small, non-linear regression problems. The modified LMA performed better than the standard LMA on these problems. This controller also dramatically outperformed the standard LMA on a difficult computer vision problem for which it had not been trained. Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a range of optimization domains. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based “controllers” that modulate the behavior of the optimizer during the optimization process. [sent-7, score-0.545]
2 For example, in the LMA a damping parameter λ is dynamically modified based on a set of rules that were developed using heuristic arguments. [sent-8, score-0.255]
3 Reinforcement learning (RL) is a machine learning approach to learn optimal controllers from examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. [sent-9, score-0.723]
4 Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. [sent-10, score-0.305]
5 For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. [sent-11, score-0.299]
6 Here we show that a popular modern reinforcement learning technique using a very simple state space can dramatically improve the performance of general purpose optimizers, like the LMA. [sent-12, score-0.225]
7 Surprisingly the controllers learned for a particular domain also work well in very different optimization domains. [sent-13, score-0.472]
8 For example we used RL methods to train a new controller for the damping parameter of the LMA. [sent-14, score-0.409]
9 This controller was trained on a collection of classic, relatively small, non-linear regression problems. [sent-15, score-0.23]
10 This controller also dramatically outperformed the standard LMA on a difficult computer vision problem for which it had not been trained. [sent-17, score-0.244]
11 Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a range of optimization domains. [sent-18, score-0.432]
12 1 Introduction Most popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA) use simple “controllers” that modulate the behavior of the optimization algorithm based on the state of the optimization process. [sent-19, score-0.735]
13 For example, in the LMA a damping factor λ modifies the descent step to behave more like Gradient Descent or more like the Gauss-Newton optimization algorithm [1, 2]. [sent-20, score-0.522]
14 1 The LMA uses the following heuristic for controlling λ: If an iteration of the LMA with the current damping factor λt reduces the error then the new parameters produced by the LMA iteration are accepted and the damping factor is divided by a constant term η > 0, i. [sent-21, score-0.7]
15 Otherwise, if the error is not reduced, the new parameters are not accepted, the damping factor is multiplied by η, and the LMA iteration is repeated with the new damping parameter. [sent-24, score-0.508]
16 While various heuristic arguments have been used to justify this particular way of controlling the damping factor, it is not clear whether this “controller” is optimal in any way or whether it can be significantly improved. [sent-25, score-0.255]
17 Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. [sent-26, score-0.305]
18 For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. [sent-27, score-0.299]
19 Time constrained optimization is in fact becoming an increasingly important problem in applications such as operations research, robotics, and machine perception. [sent-28, score-0.191]
20 Given the special properties of time constrained optimization problems it is likely that the heuristic-based controllers used in off-the-shelf optimizers may not be particularly efficient. [sent-30, score-0.532]
21 Additionally, standard techniques for non-linear optimization like the LMA do not address issues such as when to stop a fruitless local search or when to revisit a previously visited part of the parameter space. [sent-31, score-0.234]
22 Reinforcement learning (RL) is a machine learning approach to learn optimal controllers by examples and thus is an obvious candidate to improve the heuristic-based controllers used in the most popular and heavily used optimization algorithms. [sent-32, score-0.723]
23 An advantage of RL methods over other approaches to optimal control is that they do not require prior knowledge of the underlying system dynamics and the system designer is free to choose reward metrics that best match the desiderata for controller performance. [sent-33, score-0.309]
24 For example, in the case of optimization under time constraints a suitable reward could be to achieve the minimum loss within a fixed amount of time. [sent-34, score-0.355]
25 2 Related Work The idea of using RL in optimization problems is not new [3, 4, 5, 6, 7]. [sent-35, score-0.216]
26 Here our focus is on using RL methods to modify the controllers implicit in the most popular and heavily used optimization algorithms. [sent-37, score-0.49]
27 In particular our goal is to make these algorithms more efficient for optimization on time budget problems. [sent-38, score-0.406]
28 As we will soon show, a simple RL approach can result in dramatic improvements in performance of these popular optimization packages. [sent-39, score-0.254]
29 There has also been some work on empirical evaluations of the LMA algorithm versus other nonlinear optimization methods in the computer vision community. [sent-40, score-0.331]
30 For example, on each iteration of gradient descent, parameters are changed in the opposite direction of the gradient of the loss function, e. [sent-46, score-0.223]
31 , xk+1 = xk − η x f (xk ) (1) Steepest Descent has convergence guarantees provided the value of η is reduced over the course of the optimization and in general is robust, but quite slow. [sent-48, score-0.336]
32 The 2 if f (xk ) f (xk ) > f (xk−1 ) f (xk−1 ) then xk ← xk−1 λ←η×λ else 1 λ← η ×λ end if Figure 1: A heuristic algorithm for updating lambda during Levenberg-Marquardt non-linear least squares optimization. [sent-51, score-0.264]
33 Each iteration of the Gauss-Newton method is of the following form: xk+1 = xk − H −1 d (2) The Gauss-Newton method has a much faster convergence rate than gradient descent, however, it is not as robust as gradient descent. [sent-55, score-0.296]
34 Levenberg-Marquardt [1] is a popular optimization algorithm that attempts to blend gradient descent and Gauss-Newton in order to obtain both the fast convergence rate of Gauss-Newton and the convergence guarantees of gradient descent. [sent-57, score-0.408]
35 The algorithm has the following update rule: xk+1 = xk − (H + λdiag(H))−1 d (3) This update rule is also known as damped Gauss-Newton because the λ parameter serves to dampen the Gauss-Newton step by blending it with the gradient descent step. [sent-58, score-0.279]
36 Marquardt proposed a heuristic based control law to dynamically modify λ during the optimization process (see Figure 1). [sent-59, score-0.278]
37 The LMA algorithm has recently become a very popular approach to solve real-time problems in computer vision [9, 10, 11], such as object tracking and feature tracking in video. [sent-61, score-0.247]
38 Due to the special nature of this problem it is unclear whether the heuristic-based controller embedded in the algorithm is optimal or could be significantly improved upon. [sent-62, score-0.207]
39 In the remainder of this document we explore whether reinforcement learning methods can help improve the performance of LMA by developing an empirically learned controller of the damping factor rather than the commonly used heuristic controller. [sent-63, score-0.648]
40 4 Learning Control Policies for Optimization Algorithms An optimizer is an algorithm that uses some statistics about the current progress of the optimization in order to produce a next iterate to evaluate. [sent-64, score-0.378]
41 It is natural to frame optimization in the language of control theory by thinking of the statistics of the optimization progress used by the controller to choose the next iterate as the control state and the next iterate to visit as the control action. [sent-65, score-0.901]
42 In this work we choose to restrict our state space to a few statistics that capture both the current time constraints and the recent progress of the optimization procedure. [sent-66, score-0.344]
43 The action space is restricted by making the observation that current methods for non-linear optimization provide good suggestions for the next point to visit. [sent-67, score-0.328]
44 In order to define the optimality of a controller we define a reward function that indicates the desirability of the solution found during optimization. [sent-69, score-0.317]
45 In the context of optimization with semi-rigid time constraints an appropriate reward function balances reduction in loss of the objective function with the number of steps needed to achieve that reduction. [sent-70, score-0.409]
46 In the case of optimization with a fixed budget, a more natural choice might be the overall reduction in the loss function within the alloted budget of function evaluations. [sent-71, score-0.532]
47 The construction of the approximate action-value function and the policy improvement step are performed using the techniques outlined in [12]. [sent-73, score-0.218]
48 the reward function could be modified to include features of intermediate solutions that are likely to indicate the desirability of the current point. [sent-74, score-0.17]
49 Given a state space, action space, and reward function for a given optimization problem, reinforcement learning methods provide an appropriate set of techniques for learning an optimal optimization controller. [sent-75, score-0.662]
50 LSPI is an iterative procedure that repeatedly applies the following two steps until convergence: approximating the action-value function as a linear combination of a fixed set of basis functions and then improving the current policy greedily over the approximate value function. [sent-78, score-0.231]
51 The bases are functions of the state and action and can be non-linear. [sent-79, score-0.165]
52 The output of the LSPI procedure is a weight vector that defines the action-value function of the optimal policy as a linear combination of the basis vectors. [sent-81, score-0.173]
53 Our method for learning an optimization controller consists of two phases. [sent-82, score-0.398]
54 In the first phase samples are collected through interactions between a random optimization controller and an optimization problem in a series of fixed length optimization episodes. [sent-83, score-0.802]
55 These samples are tuples of the form (s, a, r, s ) where s denotes the state arrived at when action a was executed starting from state s and reward r was received. [sent-84, score-0.276]
56 The second phase of our algorithm applies LSPI to learn an actionvalue function and implicitly an optimal policy (which is given by the greedy maximization of the action-value function over actions for a given state). [sent-85, score-0.206]
57 5 Experiments We demonstrate the ability of our method to both achieve superior performance to off the shelf nonlinear optimization techniques as well as provide insight into the specific policies and action-value functions learned. [sent-87, score-0.363]
58 1 Optimizing Nonlinear Least-Squares Functions with a Fixed Budget Both the classical non-linear problems and the facial expression recognition task were formulated in terms of optimization given a fixed budget of function evaluations. [sent-89, score-0.554]
59 Each optimization problem takes the form of minimizing the sum of squares of non-linear functions and thus are well-suited to Levenberg-Marquardt style optimization. [sent-91, score-0.28]
60 These additional actions include moving to a new random point in the domain of the objective function and also returning to the best point found so far and performing one descent step using the LMA (using the current damping factor). [sent-93, score-0.396]
61 The number of actions available at each step is 8 (6 for various combinations of adjustments to λ and returning the the previous iterate along with the 2 additional actions just described). [sent-94, score-0.219]
62 The state space used to make the action decision includes a fixed-length window of history that encodes whether a particular step in the past increased or decreased the residual error from the previous iterate. [sent-95, score-0.349]
63 This window is set to size 2 for most of our experiments, however, we did evaluate the relative improvement of using a window size of 1 versus 2 (see Figure 4). [sent-96, score-0.178]
64 Also included in the state space is the amount of function evaluations left in our budget and a problem-specific state feature described in Section 5. [sent-97, score-0.488]
65 The state and action space are mapped through a collection of fixed basis functions which the LSPI algorithm combines linearly to approximate the optimal action-value function. [sent-99, score-0.255]
66 For most applications of LSPI these functions consist of radial-basis functions distributed throughout the continuous state and action space. [sent-100, score-0.188]
67 The basis we use in our problem treats each action independently and thus constructs a tuple of basis functions for each action. [sent-101, score-0.205]
68 To encode the number of evaluations left in the optimization episode, we use a collection of radial-basis functions centered at different values of budget remaining (specifically we use basis functions spaced at 4 step intervals with a bandwidth of . [sent-102, score-0.613]
69 The history window of whether the loss went up or down during recent iterations of the algorithm is represented as a d-dimensional binary vector where d is the length of history window considered. [sent-104, score-0.342]
70 2 Classical Nonlinear Least Squares Problems In order to validate our approach we apply it to a dataset of classical non-linear optimization problems [13]. [sent-108, score-0.261]
71 This dataset of problems includes famous optimization problems that cover a wide variety of non-linear behavior. [sent-109, score-0.241]
72 When restricted to a budget of 5 function evaluations, our method is able to learn a policy which results in a 6% gain in performance (measured in total reduction in loss from the starting point) when compared to the LMA. [sent-111, score-0.504]
73 In our work we frame the problem of feature selection as an optimization procedure over a continuous parameter space. [sent-115, score-0.231]
74 We learn a detector for the presence or absence of a smile using the pixel intensities of an image patch containing a face. [sent-118, score-0.299]
75 We test our algorithm on the task of smile detection using a subset of 1, 000 images from the GENKI dataset (which is a collection of 60, 000 faces from the web). [sent-130, score-0.294]
76 In this experiment our goal is to predict the human smile labels using the L2boost procedure outlined above. [sent-132, score-0.216]
77 After collecting samples from 100 episodes of optimization on the GENKI dataset, LSPI is able to learn a policy that achieves a 2. [sent-136, score-0.354]
78 Figure 4 shows that the policies learned using our method not only achieves greater reduction in loss on the training set, but that this reduction in loss translates to a significant gain in performance for classification on a validation set of test images. [sent-140, score-0.451]
79 083 better classification performance (as measured by area under the ROC curve) depending on the optimization budget. [sent-143, score-0.191]
80 Note that given the relatively high baseline performance of the LMA on the smile detection task, an improvement of . [sent-144, score-0.272]
81 Learning a policy that uses a history window of error changes on the last two time steps is able to achieve a 16% greater reduction in total loss than a policy learned with a history window of size 1. [sent-147, score-0.763]
82 Also of interest is the nature of the policies learned for smile detection on a fixed budget. [sent-148, score-0.418]
83 0 Figure 4: Top: The performance on detecting smile versus not smile is substantially better when using an optimization controller learned with our algorithm than using the default LMA. [sent-158, score-0.882]
84 Bottom: This table describes the relative improvement in total loss reduction for policies learned using our method. [sent-161, score-0.332]
85 however, if the error is decreasing it is best to continue to apply local optimization methods. [sent-162, score-0.191]
86 Later in the optimization, the policy always performs a Levenberg-Marquardt step on the current best point no matter what the change in error was. [sent-163, score-0.189]
87 This strategy makes sense since once a few different parts of the state space have been investigated the utility of sampling a new part of the state space is reduced. [sent-164, score-0.184]
88 The first trend is that the learned policy favors discarding the last iterate versus keeping (similar to the LMA). [sent-166, score-0.331]
89 The second trend is that the policy favors increasing the damping parameter when the error has increased on the last iteration and decreasing the damping factor when the error has decreased (also similar to the LMA). [sent-167, score-0.725]
90 4 Cross Generalization A property of choosing a general state space for our method is that the policies learned on one class of optimization problem are applicable to other classes of optimization. [sent-169, score-0.459]
91 The optimization controllers learned in the classical least squares minimization task achieve a 19% improvement over the standard LMA on the smile detection task. [sent-170, score-0.879]
92 Applying the controllers learned on the smile detection task to the classical least squares problem yields a more modest 5% improvement. [sent-171, score-0.634]
93 These results support the claim that our method is extracting useful structure for optimizing under a fixed budget and not simply learning a controller that is amenable to a particular problem domain. [sent-172, score-0.474]
94 6 Conclusion We have presented a novel approach to the problem of learning optimization procedures for optimization on a fixed budget. [sent-173, score-0.382]
95 We have shown that our approach achieves better performance than ubiquitous methods for non-linear least squares optimization on the task of optimizing within a fixed budget of function evaluations for both classical non-linear functions and a difficult computer vision task. [sent-174, score-0.676]
96 We have also provided an analysis of the patterns learned by our method and how they 7 make sense in the context of optimization under a fixed budget. [sent-175, score-0.27]
97 For instance, by incorporating domain specific features into the state space richer policies might be learned. [sent-179, score-0.214]
98 III, “Global search in combinatorial optimization using reinforcement learning algorithms,” in Proceedings of the Congress on Evolutionary Computation, vol. [sent-192, score-0.261]
99 Moore, “Learning evaluation functions for global optimization and boolean satisfiability,” in AAAI/IAAI, 1998, pp. [sent-209, score-0.214]
100 Argyros, “Is levenberg-marquardt the most efficient optimization algorithm for implementing bundle adjustment? [sent-224, score-0.218]
wordName wordTfidf (topN-words)
[('lma', 0.665), ('budget', 0.215), ('controller', 0.207), ('controllers', 0.202), ('damping', 0.202), ('optimization', 0.191), ('smile', 0.182), ('xk', 0.145), ('lspi', 0.14), ('policy', 0.132), ('policies', 0.097), ('optimizers', 0.091), ('crossbar', 0.083), ('rl', 0.079), ('learned', 0.079), ('facial', 0.078), ('action', 0.076), ('evaluations', 0.075), ('window', 0.074), ('loss', 0.072), ('descent', 0.07), ('reinforcement', 0.07), ('reward', 0.068), ('iteration', 0.067), ('lter', 0.067), ('optimizer', 0.067), ('squares', 0.066), ('state', 0.066), ('popular', 0.063), ('genki', 0.062), ('xopt', 0.062), ('history', 0.061), ('detection', 0.06), ('iterate', 0.059), ('marquardt', 0.055), ('reduction', 0.054), ('heuristic', 0.053), ('classical', 0.045), ('visited', 0.043), ('actions', 0.043), ('desirability', 0.042), ('movellan', 0.042), ('gradient', 0.042), ('default', 0.041), ('basis', 0.041), ('tracking', 0.041), ('feature', 0.04), ('offset', 0.04), ('vision', 0.037), ('factor', 0.037), ('video', 0.037), ('patch', 0.036), ('levenberg', 0.036), ('current', 0.035), ('trend', 0.035), ('heavily', 0.034), ('control', 0.034), ('outlined', 0.034), ('boyan', 0.033), ('modulate', 0.033), ('allocated', 0.031), ('learn', 0.031), ('roc', 0.031), ('improvement', 0.03), ('restarts', 0.029), ('faces', 0.029), ('depicted', 0.029), ('filter', 0.028), ('adjustments', 0.028), ('amenable', 0.028), ('episode', 0.028), ('nonlinear', 0.028), ('height', 0.027), ('bundle', 0.027), ('space', 0.026), ('favors', 0.026), ('progress', 0.026), ('box', 0.025), ('features', 0.025), ('san', 0.025), ('pixel', 0.025), ('image', 0.025), ('problems', 0.025), ('achieve', 0.024), ('horizontal', 0.024), ('tuple', 0.024), ('increased', 0.024), ('returning', 0.024), ('xed', 0.024), ('optimizing', 0.024), ('diego', 0.024), ('boxes', 0.024), ('particularly', 0.023), ('collection', 0.023), ('translates', 0.023), ('incoming', 0.023), ('functions', 0.023), ('interactions', 0.022), ('step', 0.022), ('vertical', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
Author: Paul L. Ruvolo, Ian Fasel, Javier R. Movellan
Abstract: Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based “controllers” that modulate the behavior of the optimizer during the optimization process. For example, in the LMA a damping parameter λ is dynamically modified based on a set of rules that were developed using heuristic arguments. Reinforcement learning (RL) is a machine learning approach to learn optimal controllers from examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. Here we show that a popular modern reinforcement learning technique using a very simple state space can dramatically improve the performance of general purpose optimizers, like the LMA. Surprisingly the controllers learned for a particular domain also work well in very different optimization domains. For example we used RL methods to train a new controller for the damping parameter of the LMA. This controller was trained on a collection of classic, relatively small, non-linear regression problems. The modified LMA performed better than the standard LMA on these problems. This controller also dramatically outperformed the standard LMA on a difficult computer vision problem for which it had not been trained. Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a range of optimization domains. 1
2 0.14871043 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
3 0.13300759 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
4 0.11954349 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
6 0.094338693 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
7 0.088757679 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
8 0.086717978 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
9 0.083832212 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
10 0.0714885 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
11 0.069726013 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
12 0.066330217 223 nips-2008-Structure Learning in Human Sequential Decision-Making
13 0.065962873 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
14 0.063792445 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
15 0.062525243 119 nips-2008-Learning a discriminative hidden part model for human action recognition
16 0.057596084 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
17 0.056836028 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
18 0.056793887 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
19 0.05428277 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
20 0.052025404 62 nips-2008-Differentiable Sparse Coding
topicId topicWeight
[(0, -0.181), (1, 0.165), (2, -0.005), (3, -0.107), (4, 0.088), (5, 0.059), (6, 0.05), (7, -0.095), (8, -0.058), (9, 0.046), (10, 0.042), (11, -0.01), (12, 0.001), (13, 0.017), (14, 0.021), (15, -0.05), (16, 0.004), (17, -0.009), (18, 0.018), (19, 0.012), (20, -0.007), (21, 0.031), (22, -0.007), (23, -0.016), (24, -0.028), (25, -0.008), (26, -0.016), (27, -0.02), (28, -0.123), (29, 0.009), (30, 0.011), (31, 0.009), (32, 0.049), (33, 0.0), (34, -0.023), (35, -0.081), (36, -0.011), (37, 0.064), (38, 0.027), (39, -0.017), (40, -0.018), (41, -0.06), (42, 0.043), (43, -0.051), (44, 0.056), (45, -0.021), (46, -0.023), (47, 0.024), (48, -0.058), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.93726933 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
Author: Paul L. Ruvolo, Ian Fasel, Javier R. Movellan
Abstract: Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based “controllers” that modulate the behavior of the optimizer during the optimization process. For example, in the LMA a damping parameter λ is dynamically modified based on a set of rules that were developed using heuristic arguments. Reinforcement learning (RL) is a machine learning approach to learn optimal controllers from examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. Here we show that a popular modern reinforcement learning technique using a very simple state space can dramatically improve the performance of general purpose optimizers, like the LMA. Surprisingly the controllers learned for a particular domain also work well in very different optimization domains. For example we used RL methods to train a new controller for the damping parameter of the LMA. This controller was trained on a collection of classic, relatively small, non-linear regression problems. The modified LMA performed better than the standard LMA on these problems. This controller also dramatically outperformed the standard LMA on a difficult computer vision problem for which it had not been trained. Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a range of optimization domains. 1
2 0.76685125 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
3 0.75253141 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
Author: Gerhard Neumann, Jan R. Peters
Abstract: Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantageweighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces. 1
4 0.70789862 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
6 0.68268836 195 nips-2008-Regularized Policy Iteration
7 0.66117555 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
8 0.65647709 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
9 0.62220037 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
10 0.62052143 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
11 0.54883111 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
12 0.50780994 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
13 0.5054583 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
14 0.46731591 212 nips-2008-Skill Characterization Based on Betweenness
15 0.46606338 61 nips-2008-Diffeomorphic Dimensionality Reduction
16 0.46538261 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
17 0.45033541 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
18 0.44986919 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
19 0.43653828 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
20 0.41566709 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
topicId topicWeight
[(4, 0.031), (6, 0.066), (7, 0.061), (12, 0.043), (16, 0.26), (28, 0.171), (57, 0.075), (59, 0.02), (63, 0.039), (71, 0.018), (77, 0.067), (83, 0.069)]
simIndex simValue paperId paperTitle
1 0.83928019 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel
Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
same-paper 2 0.79220688 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
Author: Paul L. Ruvolo, Ian Fasel, Javier R. Movellan
Abstract: Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based “controllers” that modulate the behavior of the optimizer during the optimization process. For example, in the LMA a damping parameter λ is dynamically modified based on a set of rules that were developed using heuristic arguments. Reinforcement learning (RL) is a machine learning approach to learn optimal controllers from examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. Here we show that a popular modern reinforcement learning technique using a very simple state space can dramatically improve the performance of general purpose optimizers, like the LMA. Surprisingly the controllers learned for a particular domain also work well in very different optimization domains. For example we used RL methods to train a new controller for the damping parameter of the LMA. This controller was trained on a collection of classic, relatively small, non-linear regression problems. The modified LMA performed better than the standard LMA on these problems. This controller also dramatically outperformed the standard LMA on a difficult computer vision problem for which it had not been trained. Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a range of optimization domains. 1
3 0.78517264 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
Author: Tong Zhang
Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1
4 0.6919837 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
Author: Tong Zhang
Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1
5 0.66929513 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.66202706 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
7 0.65692067 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
8 0.64959794 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.64920884 99 nips-2008-High-dimensional support union recovery in multivariate regression
10 0.64919353 195 nips-2008-Regularized Policy Iteration
11 0.64884537 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
12 0.64757651 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
13 0.64621508 194 nips-2008-Regularized Learning with Networks of Features
14 0.64589059 62 nips-2008-Differentiable Sparse Coding
15 0.64538544 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
16 0.6442132 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
17 0.64404118 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
18 0.64311266 231 nips-2008-Temporal Dynamics of Cognitive Control
19 0.6426481 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
20 0.64253008 73 nips-2008-Estimating Robust Query Models with Convex Optimization