nips nips2007 nips2007-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Sutton, Mohammad Ghavamzadeh, Mark Lee Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada Abstract We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. [sent-2, score-0.176]
2 Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. [sent-3, score-1.158]
3 Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. [sent-4, score-0.523]
4 The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. [sent-5, score-0.193]
5 The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. [sent-6, score-0.204]
6 1 Introduction Actor-critic (AC) algorithms are based on the simultaneous online estimation of the parameters of two structures, called the actor and the critic. [sent-8, score-0.285]
7 The actor corresponds to a conventional actionselection policy, mapping states to actions in a probabilistic manner. [sent-9, score-0.3]
8 The critic corresponds to a conventional value function, mapping states to expected cumulative future reward. [sent-10, score-0.339]
9 Thus, the critic addresses a problem of prediction, whereas the actor is concerned with control. [sent-11, score-0.549]
10 These problems are separable, but are solved simultaneously to find an optimal policy, as in policy iteration. [sent-12, score-0.427]
11 Actor-critic methods were among the earliest to be investigated in reinforcement learning (Barto et al. [sent-15, score-0.114]
12 They were largely supplanted in the 1990’s by methods that estimate action-value functions and use them directly to select actions without an explicit policy structure. [sent-17, score-0.489]
13 These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al. [sent-19, score-0.599]
14 Policy gradient methods without bootstrapping can be easily proved convergent, but converge slowly because of the high variance of their gradient estimates. [sent-21, score-0.354]
15 Another approach to speeding up policy gradient algorithms was proposed by Kakade (2002) and then refined and extended by Bagnell and Schneider (2003) and by Peters et al. [sent-23, score-0.654]
16 The idea 1 was to replace the policy gradient with the so-called natural policy gradient. [sent-25, score-1.026]
17 This was motivated by the intuition that a change in the policy parameterization should not influence the result of the policy update. [sent-26, score-0.854]
18 In terms of the policy update rule, the move to the natural gradient amounts to linearly transforming the gradient using the inverse Fisher information matrix of the policy. [sent-27, score-0.868]
19 All the algorithms are for the average reward setting and use function approximation in the state-value function. [sent-29, score-0.222]
20 For all four methods we prove convergence of the parameters of the policy and state-value function to a local maximum of a performance function that corresponds to the average reward plus a measure of the TD error inherent in the function approximation. [sent-30, score-0.737]
21 Due to space limitations, we do not present the convergence analysis of our algorithms here; it can be found, along with some empirical results using our algorithms, in the extended version of this paper (Bhatnagar et al. [sent-31, score-0.157]
22 The state, action, and reward at each time t ∈ {0, 1, 2, . [sent-40, score-0.135]
23 } are denoted st ∈ S, at ∈ A, and rt ∈ R respectively. [sent-43, score-0.647]
24 We assume the reward is random, realvalued, and uniformly bounded. [sent-44, score-0.135]
25 The agent selects an action at each time t using a randomized stationary policy π(a|s) = Pr(at = a|st = s). [sent-46, score-0.518]
26 We assume (B1) The Markov chain induced by any policy is irreducible and aperiodic. [sent-47, score-0.427]
27 The long-term average reward per step under policy π is defined as 1 E T →∞ T T −1 J(π) = lim dπ (s) rt+1 |π = t=0 s∈S π(a|s)r(s, a), a∈A where dπ (s) is the stationary distribution of state s under policy π. [sent-48, score-1.085]
28 Our aim is to find a policy π ∗ that maximizes the average reward, i. [sent-50, score-0.452]
29 In the average reward formulation, a policy π is assessed according to the expected differential reward associated with states s or state–action pairs (s, a). [sent-53, score-0.793]
30 For all states s ∈ S and actions a ∈ A, the differential action-value function and the differential state-value function under policy π are defined as1 ∞ Qπ (s, a) = E[rt+1 − J(π)|s0 = s, a0 = a, π] , V π (s) = t=0 π(a|s)Qπ (s, a). [sent-54, score-0.578]
31 Since in this setting a policy π is represented by its parameters θ, policy dependent functions such as J(π), dπ (·), V π (·), and Qπ (·, ·) can be written as J(θ), d(·; θ), V (·; θ), and Q(·, ·; θ), respectively. [sent-56, score-0.896]
32 We assume (B2) For any state–action pair (s, a), policy π(a|s; θ) is continuously differentiable in the parameters θ. [sent-57, score-0.451]
33 , 2000; Baxter & Bartlett, 2001) have shown that the gradient of the average reward for parameterized policies that satisfy (B1) and (B2) is given by 2 dπ (s) J(π) = s∈S π(a|s)Qπ (s, a). [sent-59, score-0.354]
34 2 Throughout the paper, we use notation to denote θ – the gradient w. [sent-61, score-0.129]
35 = X dπ (s)b(s) (1) = 0, s∈S and thus, for any baseline b(s), the gradient of the average reward can be written as dπ (s) J(π) = s∈S (3) π(a|s)(Qπ (s, a) ± b(s)). [sent-66, score-0.391]
36 a∈A The baseline can be chosen such in a way that the variance of the gradient estimates is minimized (Greensmith et al. [sent-67, score-0.342]
37 The natural gradient, denoted ˜ J(π), can be calculated by linearly transforming the regular gradient, using the inverse Fisher information matrix of the policy: ˜ J(π) = G−1 (θ) J(π). [sent-69, score-0.137]
38 3 Policy Gradient with Function Approximation Now consider the case in which the action-value function for a fixed policy π, Q π , is approximated by a learned function approximator. [sent-71, score-0.427]
39 Suppose E (w) denotes the mean squared error π E π (w) = dπ (s) s∈S (7) π(a|s)[Qπ (s, a) − w ψ(s, a) − b(s)]2 a∈A of our compatible linear parameterized approximation w ψ(s, a) and an arbitrary baseline b(s). [sent-83, score-0.262]
40 Lemma 1 shows that the value of w ∗ does not depend on the given baseline b(s); as a result the mean squared error problems of Eqs. [sent-85, score-0.141]
41 Lemma 2 shows that if the parameter is set to be equal to w ∗ , then the resulting mean squared error E π (w∗ ) (now treated as a function of the baseline b(s)) is further minimized when b(s) = V π (s). [sent-87, score-0.166]
42 In other words, the variance in the action-value-function estimator is minimized if the baseline is chosen to be the state-value function itself. [sent-88, score-0.158]
43 a∈A 3 It is important to note that Lemma 2 is not about the minimum variance baseline for gradient estimation. [sent-92, score-0.245]
44 It is about the minimum variance baseline of the action-value-function estimator. [sent-93, score-0.116]
45 P P Next, given the optimum weight parameter w ∗ , we obtain the minimum variance baseline in the action-value-function estimator corresponding to policy π. [sent-99, score-0.581]
46 Lemma 2 For any given policy π, the minimum variance baseline b∗ (s) in the action-valuefunction estimator corresponds to the state-value function V π (s). [sent-101, score-0.579]
47 Note that by (B1), the Markov chain corresponding to any policy π is positive recurrent because the number of states is finite. [sent-104, score-0.447]
48 The next lemma shows that δt is a consistent estimate of the advantage function Aπ . [sent-120, score-0.162]
49 Lemma 3 Under given policy π, we have E[δt |st , at , π] = Aπ (st , at ). [sent-121, score-0.427]
50 ˆ ˆ However, calculating δt requires having estimates, J, V , of the average reward and the value function. [sent-130, score-0.16]
51 While an average reward estimate is simple enough to obtain given the single-stage reward function, the same is not necessarily true for the value function. [sent-131, score-0.328]
52 One may then approximate V π (s) with v f (s), where v is a parameter vector that can be tuned (for a fixed policy π) using a ˆ TD algorithm. [sent-134, score-0.427]
53 Also, let δt = rt+1 − Jt+1 + v π f (st+1 ) − v π f (st ), where δt corresponds to a stationary estimate of the TD error with function approximation under policy π. [sent-138, score-0.562]
54 Proof of this lemma can be found in the extended version of this paper (Bhatnagar et al. [sent-140, score-0.152]
55 For the case with function approximation that we study, from ¯ Lemma 4, the quantity s∈S dπ (s)[ V π (s) − v π f (s)] may be viewed as the error or bias in the estimate of the gradient of average reward that results from the use of function approximation. [sent-143, score-0.386]
56 They update the policy parameters along the direction of the average-reward gradient. [sent-146, score-0.543]
57 While estimates of the regular gradient are used for this purpose in Algorithm 1, natural gradient estimates are used in Algorithms 2–4. [sent-147, score-0.382]
58 While critic updates in our algorithms can be easily extended to the case of TD(λ), λ > 0, we restrict our attention to the case when λ = 0. [sent-148, score-0.355]
59 In addition to assumptions (B1) and (B2), we make the following assumption: (B3) The step-size schedules for the critic {αt } and the actor {βt } satisfy αt = t βt = ∞ 2 αt , , t t 2 βt < ∞ t , βt = 0. [sent-149, score-0.553]
60 Hence the critic has uniformly higher increments than the actor beyond some t0 , and thus it converges faster than the actor. [sent-152, score-0.532]
61 Input: • Randomized parameterized policy π(·|·; θ), • Value function feature vector f (s). [sent-154, score-0.454]
62 do Execution: • Draw action at ∼ π(at |st ; θ t ), • Observe next state st+1 ∼ p(st+1 |st , at ), • Observe reward rt+1 . [sent-159, score-0.198]
63 5 This is the only AC algorithm presented in the paper that is based on the regular gradient estimate. [sent-162, score-0.17]
64 Its per time-step computational cost is linear in the number of policy and value-function parameters. [sent-164, score-0.427]
65 Our second algorithm stores a matrix G−1 and two parameter vectors θ and v. [sent-178, score-0.105]
66 Its per timestep computational cost is linear in the number of value-function parameters and quadratic in the number of policy parameters. [sent-179, score-0.451]
67 Algorithm 2 (Natural-Gradient AC with Fisher Information Matrix): Critic Update: v t+1 = v t + αt δt f (st ), Actor Update: θ t+1 = θ t + βt G−1 δt ψ(st , at ), t+1 with the estimate of the inverse Fisher information matrix updated according to Eq. [sent-180, score-0.105]
68 As mentioned in Section 3, it is better to think of the compatible approximation w ψ(s, a) as an approximation of the advantage function rather than of the action-value function. [sent-187, score-0.156]
69 In our next algorithm we tune the parameters w in such a way as to minimize an estimate of the least-squared error E π (w) = Es∼dπ ,a∼π [(w ψ(s, a) − Aπ (s, a))2 ]. [sent-188, score-0.106]
70 The gradient of E π (w) is thus w E π (w) = 2Es∼dπ ,a∼π [(w ψ(s, a) − Aπ (s, a))ψ(s, a)], which can be estimated as π w E (w) = 2[ψ(st , at )ψ(st , at ) w − δt ψ(st , at )]. [sent-189, score-0.129]
71 Hence, we update advantage parameters w along with value-function parameters v in the critic update of this algorithm. [sent-190, score-0.515]
72 (2005), we use the natural gradient estimate ˜ J(θ t ) = wt+1 in the actor update of Alg. [sent-192, score-0.506]
73 Its per time-step computational cost is linear in the number of value-function parameters and quadratic in the number of policy parameters. [sent-195, score-0.451]
74 Although an estimate of G−1 (θ) is not explicitly computed and used in Algorithm 3, the convergence analysis of this algorithm shows that the overall scheme still moves in the direction of the natural gradient of average reward. [sent-197, score-0.33]
75 In Algorithm 4, however, we explicitly estimate G −1 (θ) (as in Algorithm 2), and use it in the critic update for w. [sent-198, score-0.402]
76 The overall scheme is again seen to follow the direction of the natural gradient of average reward. [sent-199, score-0.22]
77 Here, we let ˜ w E π (w) = 2G−1 [ψ(st , at )ψ(st , at ) w − δt ψ(st , at )] be the estimate of the natural gradient of the leastt squared error E π (w). [sent-200, score-0.262]
78 Its per time-step computational cost is linear in the number of value-function parameters and quadratic in the number of policy parameters. [sent-203, score-0.451]
79 However, because the critic will generally converge to an approximation of the desired projection of the value function (defined by the value function features f ) in these algorithms, the corresponding convergence results are necessarily weaker, as indicated by the following theorem. [sent-208, score-0.392]
80 ˆ For the parameter iterations in Algorithms 1-4,5 we have (Jt , v t , θ t ) → {(J(θ ∗ ), v θ , θ ∗ )|θ ∗ ∈ Z} as t → ∞ with probability one, where the set Z corresponds to π the set of local maxima of a performance function whose gradient is E[δt ψ(st , at )|θ] (cf. [sent-209, score-0.148]
81 This theorem indicates that the policy and state-value-function parameters converge to a local maximum of a performance function that corresponds to the average reward plus a measure of the TD error inherent in the function approximation. [sent-213, score-0.661]
82 2–4, their algorithm does not use estimates of the natural gradient in its actor’s update. [sent-215, score-0.219]
83 1) Konda’s algorithm uses the Markov process of state– action pairs, and thus its critic update is based on an action-value function. [sent-218, score-0.425]
84 1 uses the state process, and therefore its critic update is based on a state-value function. [sent-220, score-0.394]
85 1 uses a TD error in both critic and actor recursions, Konda’s algorithm uses a TD error only in its critic update. [sent-222, score-0.912]
86 The actor recursion in Konda’s algorithm uses an action-value estimate instead. [sent-223, score-0.305]
87 Because the TD error is a consistent estimate of the advantage function (Lemma 3), the actor recursion in Alg. [sent-224, score-0.364]
88 3) The convergence analysis of Konda’s algorithm is based on the martingale approach and aims at bounding error terms and directly showing convergence; convergence to a local optimum is shown when a TD(1) critic is used. [sent-226, score-0.514]
89 For the case where λ < 1, they show that given an > 0, there exists λ close enough to one such that when a TD(λ) critic is used, one gets lim inf t | J(θ t )| < with 5 The proof of this theorem requires another assumption viz. [sent-227, score-0.355]
90 Unlike Konda and Tsitsiklis, we primarily use the ordinary differential equation (ODE) based approach for our convergence analysis. [sent-234, score-0.11]
91 (2005): Our Algorithms 2–4 extend their algorithm by being fully incremental and in that we provide convergence proofs. [sent-237, score-0.144]
92 It is not clear how to satisfactorily incorporate least-squares TD methods in a context in which the policy is changing, and our proof techniques do not immediately extend to this case. [sent-239, score-0.475]
93 7 Conclusions and Future Work We have introduced and analyzed four AC algorithms utilizing both linear function approximation and bootstrapping, a combination which seems essential to large-scale applications of reinforcement learning. [sent-240, score-0.15]
94 All of the algorithms are based on existing ideas such as TD-learning, natural policy gradients, and two-timescale stochastic approximation, but combined in new ways. [sent-241, score-0.499]
95 The main contribution of this paper is proving convergence of the algorithms to a local maximum in the space of policy and value-function parameters. [sent-242, score-0.535]
96 The way we use natural gradients is distinctive in that it is totally incremental: the policy is changed on every time step, yet the gradient computation is never reset as it is in the algorithm of Peters et al. [sent-245, score-0.698]
97 It never explicitly stores an estimate of the inverse Fisher information matrix and, as a result, it requires less computation. [sent-249, score-0.145]
98 1) It is important to characterize the quality of the converged solutions, either by bounding the performance loss due to bootstrapping and approximation error, or through a thorough empirical study. [sent-254, score-0.116]
99 Variance reduction techniques for gradient estimates in reinforcement learning. [sent-288, score-0.229]
100 Policy gradient methods for reinforcement learning with function approximation. [sent-340, score-0.2]
wordName wordTfidf (topN-words)
[('st', 0.551), ('policy', 0.427), ('critic', 0.3), ('actor', 0.232), ('td', 0.206), ('konda', 0.192), ('ac', 0.152), ('jt', 0.136), ('reward', 0.135), ('gradient', 0.129), ('sutton', 0.119), ('gt', 0.112), ('bhatnagar', 0.11), ('peters', 0.101), ('fisher', 0.096), ('rt', 0.096), ('baseline', 0.084), ('lemma', 0.083), ('tsitsiklis', 0.077), ('reinforcement', 0.071), ('update', 0.069), ('bootstrapping', 0.064), ('stores', 0.064), ('compatible', 0.061), ('convergence', 0.059), ('marbach', 0.055), ('baxter', 0.055), ('differential', 0.051), ('wt', 0.051), ('barto', 0.049), ('vijayakumar', 0.048), ('schaal', 0.048), ('incremental', 0.047), ('et', 0.043), ('natural', 0.043), ('kakade', 0.039), ('gradients', 0.038), ('action', 0.038), ('ghavamzadeh', 0.037), ('greensmith', 0.037), ('supt', 0.037), ('bartlett', 0.036), ('approximation', 0.033), ('estimate', 0.033), ('variance', 0.032), ('qw', 0.032), ('humanoid', 0.032), ('dissertation', 0.032), ('doctoral', 0.032), ('equating', 0.032), ('schneider', 0.032), ('temporal', 0.032), ('error', 0.031), ('claim', 0.029), ('advantage', 0.029), ('actions', 0.029), ('algorithms', 0.029), ('estimates', 0.029), ('proof', 0.028), ('eligibility', 0.027), ('parameterized', 0.027), ('lim', 0.027), ('extended', 0.026), ('bagnell', 0.026), ('martingale', 0.026), ('squared', 0.026), ('average', 0.025), ('inverse', 0.025), ('minimized', 0.025), ('state', 0.025), ('parameters', 0.024), ('updated', 0.024), ('direction', 0.023), ('regular', 0.023), ('matrix', 0.023), ('transforming', 0.023), ('symmetric', 0.022), ('recursion', 0.022), ('alberta', 0.022), ('traces', 0.021), ('equals', 0.021), ('optimum', 0.021), ('satisfy', 0.021), ('proving', 0.02), ('extend', 0.02), ('states', 0.02), ('corresponds', 0.019), ('stationary', 0.019), ('converged', 0.019), ('proceedings', 0.019), ('written', 0.018), ('algorithm', 0.018), ('policies', 0.017), ('randomized', 0.017), ('whereas', 0.017), ('estimator', 0.017), ('consistent', 0.017), ('agent', 0.017), ('rewards', 0.017), ('four', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
2 0.39864016 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
3 0.27069959 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
Author: Umar Syed, Robert E. Schapire
Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1
4 0.23432246 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
5 0.21673214 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
6 0.21089767 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
7 0.20884776 191 nips-2007-Temporal Difference Updating without a Learning Rate
8 0.19362479 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
9 0.1833733 162 nips-2007-Random Sampling of States in Dynamic Programming
10 0.16016218 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
11 0.15933846 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
12 0.14517356 185 nips-2007-Stable Dual Dynamic Programming
13 0.14404741 127 nips-2007-Measuring Neural Synchrony by Message Passing
14 0.13154726 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
15 0.12759317 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
16 0.12337144 197 nips-2007-The Infinite Markov Model
17 0.12090827 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
18 0.11448964 40 nips-2007-Bundle Methods for Machine Learning
19 0.099477738 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
20 0.097143218 215 nips-2007-What makes some POMDP problems easy to approximate?
topicId topicWeight
[(0, -0.219), (1, -0.444), (2, 0.096), (3, -0.112), (4, -0.242), (5, 0.102), (6, 0.023), (7, -0.085), (8, 0.059), (9, 0.097), (10, 0.002), (11, -0.077), (12, -0.045), (13, 0.023), (14, -0.009), (15, 0.058), (16, 0.019), (17, 0.084), (18, -0.178), (19, 0.271), (20, -0.127), (21, 0.109), (22, 0.166), (23, -0.055), (24, 0.13), (25, 0.109), (26, -0.039), (27, -0.038), (28, 0.051), (29, -0.005), (30, 0.028), (31, -0.054), (32, -0.041), (33, -0.069), (34, 0.025), (35, -0.029), (36, 0.016), (37, -0.038), (38, -0.001), (39, -0.03), (40, -0.074), (41, 0.036), (42, 0.077), (43, 0.03), (44, 0.048), (45, 0.039), (46, -0.008), (47, 0.033), (48, 0.082), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.97944963 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
2 0.86174494 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
3 0.78120452 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
4 0.62346375 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
5 0.60903102 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
6 0.58657223 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
7 0.52472246 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
8 0.50974983 127 nips-2007-Measuring Neural Synchrony by Message Passing
9 0.49727118 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
10 0.47748753 185 nips-2007-Stable Dual Dynamic Programming
11 0.45568123 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
12 0.44941309 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
13 0.42922398 162 nips-2007-Random Sampling of States in Dynamic Programming
14 0.42922026 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
15 0.42077968 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
16 0.39865229 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
17 0.36985701 52 nips-2007-Competition Adds Complexity
18 0.35514501 197 nips-2007-The Infinite Markov Model
19 0.30679652 40 nips-2007-Bundle Methods for Machine Learning
20 0.28601316 57 nips-2007-Congruence between model and human attention reveals unique signatures of critical visual events
topicId topicWeight
[(5, 0.03), (12, 0.225), (13, 0.133), (16, 0.048), (21, 0.045), (31, 0.016), (34, 0.015), (35, 0.026), (47, 0.129), (49, 0.021), (83, 0.117), (85, 0.033), (90, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.77103001 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
2 0.67878586 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
3 0.67811626 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
4 0.67770892 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
Author: Fred Richardson, William M. Campbell
Abstract: Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results.
5 0.67199475 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
6 0.66955829 84 nips-2007-Expectation Maximization and Posterior Constraints
7 0.65609616 86 nips-2007-Exponential Family Predictive Representations of State
8 0.65403777 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
9 0.65295762 63 nips-2007-Convex Relaxations of Latent Variable Training
10 0.64585602 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
11 0.64306784 14 nips-2007-A configurable analog VLSI neural network with spiking neurons and self-regulating plastic synapses
12 0.64278716 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
13 0.64222813 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
14 0.64215261 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
15 0.64154321 24 nips-2007-An Analysis of Inference with the Universum
16 0.63640922 134 nips-2007-Multi-Task Learning via Conic Programming
17 0.63579518 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons
18 0.6355983 100 nips-2007-Hippocampal Contributions to Control: The Third Way
19 0.63454837 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
20 0.63389528 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC