nips nips2011 nips2011-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A reinterpretation of the policy oscillation phenomenon in approximate policy iteration Paul Wagner Department of Information and Computer Science Aalto University School of Science PO Box 15400, FI-00076 Aalto, Finland pwagner@cis. [sent-1, score-2.09]
2 fi Abstract A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. [sent-3, score-1.187]
3 The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. [sent-4, score-1.19]
4 We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. [sent-5, score-0.204]
5 In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. [sent-8, score-0.714]
6 1 Introduction We consider the reinforcement learning problem in which one attempts to find a good policy for controlling a stochastic nonlinear dynamical system. [sent-11, score-0.837]
7 The majority of these methods can be categorized into greedy value function methods (critic-only) and value-based policy gradient methods (actor-critic) (e. [sent-17, score-0.95]
8 The former approach, although fast, is susceptible to potentially severe policy oscillations in presence of approximations. [sent-20, score-0.874]
9 This phenomenon is known as the policy oscillation (or policy chattering) phenomenon [7, 8]. [sent-21, score-2.082]
10 Bertsekas has recently called attention to the currently not well understood policy oscillation phenomenon [7]. [sent-26, score-1.24]
11 First, the policy oscillation phenomenon is intimately connected to some aspects of the learning dynamics at the very heart of approximate dynamic An extended version of this paper is available at http://users. [sent-31, score-1.34]
12 The policy oscillation phenomenon is strongly associated in the literature with the popular Tetris benchmark problem. [sent-38, score-1.24]
13 Impressively fast initial improvement followed by severe degradation was reported in [12] using a greedy approximate policy iteration method. [sent-42, score-1.103]
14 This degradation has been taken in the literature as a manifestation of the policy oscillation phenomenon [12, 8]. [sent-43, score-1.298]
15 Policy gradient and greedy approximate value iteration methods have shown much more stable behavior in the Tetris problem [13, 14], although it has seemed that this stability tends to come at the price of speed (see esp. [sent-44, score-0.409]
16 The typical performance levels obtained with approximate dynamic programming methods have been around 5,000 points [12, 8, 13, 16], while an improvement to around 20,000 points has been obtained in [14] by considerably lowering the discount factor. [sent-47, score-0.251]
17 It has been hypothesized in [7] that this grossly suboptimal performance of even the best-performing approximate dynamic programming methods might also have some subtle connection to the oscillation phenomenon. [sent-49, score-0.58]
18 After providing background in Section 2, we discuss the policy oscillation phenomenon in Section 3 along with three examples, one of which is novel and generalizes the others. [sent-52, score-1.24]
19 We develop a novel view to the policy oscillation phenomenon in Sections 4 and 5. [sent-53, score-1.24]
20 We validate the view also empirically in Section 6 and proceed to looking for the suggested connection between the oscillation phenomenon and the convergence issues in the Tetris problem. [sent-54, score-0.57]
21 We report empirical evidence that indeed suggests a shared explanation to the policy degradation observed in [12, 8] and the early stagnation of all the rest of the attempted approximate dynamic programming methods. [sent-55, score-0.975]
22 However, it seems that this explanation is not primarily related to the oscillation phenomenon but to numerical instability. [sent-56, score-0.552]
23 A (soft-)greedy policy π ∗ (a|s, Q) is a (stochastic) mapping from states to actions and is based on the value function Q. [sent-60, score-0.759]
24 A parameterized policy π(a|s, θ) is a stochastic mapping from states to actions and is based on the parameter vector θ. [sent-61, score-0.781]
25 In policy iteration, the current policy is fully evaluated, after which a policy improvement step is taken based on this evaluation. [sent-65, score-2.191]
26 In optimistic policy iteration, policy improvement is based on an incomplete evaluation. [sent-66, score-1.483]
27 , [2, 3]), the current policy on iteration k is usually implicit and is greedy (and thus deterministic) with respect to the value function Qk−1 of the previous policy: π ∗ (a|s, Qk−1 ) = 1 0 if a = arg maxb Qk−1 (s, b) otherwise. [sent-70, score-1.012]
28 Soft-greedy iteration is obtained by slightly softening π ∗ in some way so that 2 π ∗ (a|s, Qk−1 ) > 0, ∀a, s, the Gibbs soft-greedy policy class with a temperature τ (Boltzmann exploration) being a common choice: π ∗ (a|s, Qk−1 ) ∝ eQk−1 (s,a)/τ . [sent-72, score-0.811]
29 A common choice for approximating Q is to obtain a least-squares fit using a linear-in-parameters ˜ approximator Q with the feature basis φ∗ : ˜ Qk (s, a, wk ) = wk φ∗ (s, a) ≈ Qk (s, a) . [sent-74, score-0.195]
30 (3) For the soft-greedy case, one option is to use an approximator that will obtain an approximation of an advantage function (see [9]):1 ˜ Ak (s, a, wk ) = wk ˜ π ∗ (b|s, Ak−1 )φ∗ (s, b) φ∗ (s, a) − ≈ Ak (s, a) . [sent-75, score-0.195]
31 For greedy approximate policy iteration in the general case, policy convergence is guaranteed only up to bounded sustained oscillation [2]. [sent-77, score-2.226]
32 Optimistic variants can permit asymptotic convergence in parameters, although the corresponding policy can manifest sustained oscillation even then [8, 2, 7]. [sent-78, score-1.303]
33 For the case of greedy approximate value iteration, a line of research has provided solid (although restrictive) conditions for the approximator class for having asymptotic parameter convergence (reviewed in, e. [sent-79, score-0.283]
34 , [3]), whereas the question of policy convergence in these cases has been left quite open. [sent-81, score-0.746]
35 In the rest of the paper, our focus will be on non-optimistic approximate policy iteration. [sent-82, score-0.747]
36 , [9, 6, 4, 5]), the current policy on iteration k is explicitly represented using some differentiable stochastic policy class π(θ), the Gibbs policy with some basis φ being a common choice: π(a|s, θ) ∝ eθ φ(s,a) . [sent-85, score-2.274]
37 In actorcritic (value-based policy gradient) methods that implement a policy gradient based approximate policy iteration scheme, the so-called ‘compatibility condition’ is fulfilled if the value function is ˜ approximated using (4) with φ∗ = φ and π(θk ) in place of π ∗ (Ak−1 ) (e. [sent-87, score-2.409]
38 In this case, the value function parameter vector w becomes the natural gradient estimate η for the policy π(a|s, θ), leading to the natural actor-critic algorithm [13, 4]: η=w. [sent-90, score-0.908]
39 3 The policy oscillation phenomenon It is well known that greedy policy iteration can be non-convergent under approximations. [sent-95, score-2.173]
40 The widely used projected equation approach can manifest convergence behavior that is complex and not well understood, including bounded but potentially severe sustained policy oscillations [7, 8, 18]. [sent-96, score-0.965]
41 It is important to remember that sustained policy oscillation can take place even under (asymptotic) value function convergence (e. [sent-100, score-1.279]
42 Continuously soft-greedy action selection (which is essentially a step toward the policy 1 The approach in [4] is needed to permit temporal difference evaluation in this case. [sent-104, score-0.883]
43 A notable approach is introduced in [7] wherein it is also shown that the suboptimality bound for a converging policy sequence is much better. [sent-106, score-0.749]
44 Interestingly, for the special case of Monte Carlo estimation of action values, it is also possible to establish convergence by solely modifying the exploration scheme, which is known as consistent exploration [23] or MCESP [24]. [sent-107, score-0.204]
45 The setting likely to be the simplest possible in which oscillation occurs even with Monte Carlo policy evaluation is depicted in Figure 1. [sent-117, score-1.146]
46 The actions al and ar are available in the decision states s1 and s2 . [sent-119, score-0.211]
47 The only reward is obtained with the decision sequence (s1 , al ; s2 , ar ). [sent-121, score-0.227]
48 Greedy value function methods that operate without state estimation will oscillate between the policies π(y1 ) = al and π(y1 ) = ar , excluding the exceptions mentioned above. [sent-122, score-0.259]
49 The action values are approximated with Q(s1 , al ) = w1,l , Q(s2 , al ) = ˜ ˜ ˜ 0. [sent-131, score-0.203]
50 5w2,l , Q(s3 , al ) = w2,l , Q(s1 , ar ) = w1,r , Q(s2 , ar ) = 0. [sent-133, score-0.254]
51 The only reward is obtained with the decision sequence (s1 , al ; s2 , ar ; s3 , al ). [sent-136, score-0.293]
52 A detailed description of the oscillation phenomenon can be found in [8, §6. [sent-141, score-0.532]
53 4 Approximations and attractive stochastic policies In this section, we briefly and informally examine how policy oscillation arises in the examples in Section 3. [sent-145, score-1.261]
54 In all cases, oscillation is caused by the presence of an attractive stochastic policy, these attractors being induced by approximations. [sent-146, score-0.561]
55 1), the policy class is incapable of representing differing action distributions for the same observation with differing histories. [sent-148, score-0.82]
56 This makes the optimal sequence (y1 , al ; y1 , ar ) inexpressible for deterministic policies, whereas a stochastic policy can still emit it every now and then by chance. [sent-149, score-0.948]
57 3, the same situation is arrived at due to the insufficient capacity of the approximator: the specified value function approximator cannot express such value estimates that 4 would lead to an implicit greedy policy that attains the optimal sequence (s1 , al ; s2 , ar ; s3 , al ). [sent-151, score-1.189]
58 Generally speaking, in these cases, oscillation follows from a mismatch between the main policy class and the exploration policy class: stochastic exploration can occasionally reach the reward, but the deterministic main policy is incapable of exploiting this opportunity. [sent-152, score-2.764]
59 , it gains variance reduction in policy evaluation by making the Markov assumption. [sent-157, score-0.748]
60 Generally speaking, oscillation results in this case from perceived but non-existent improvement opportunities that vanish once an attempt is made to exploit them. [sent-163, score-0.466]
61 In summary, stochastic policies can become attractive due to deterministically unreachable or completely non-existing improvement opportunities that appear in the value function. [sent-165, score-0.248]
62 5 Policy oscillation as sustained overshooting In this section, we focus more carefully on how attractive stochastic policies lead to sustained policy oscillation when viewed within the policy gradient framework. [sent-167, score-2.712]
63 We begin by looking at a natural ˜ actor-critic algorithm that uses the Gibbs policy class (5). [sent-168, score-0.748]
64 We iterate by fully estimating Ak in (4) for the current policy π(θk ), as shown in [4], and then a gradient update is performed using (6): θk+1 = θk + αηk . [sent-169, score-0.846]
65 (7) Now let us consider some policy π(θk ) from such a policy sequence generated by (7) and denote the ˜ corresponding value function estimate by Ak and the natural gradient estimate by ηk . [sent-170, score-1.576]
66 It is shown in [13] that taking a very long step in the direction of the natural gradient ηk will approach in the limit ˜ a greedy update (1) for the value function Ak : lim ˜ π(a|s, θk + αηk ) = π ∗ (a|s, Ak ) , α → ∞, θk → ∞, η = 0, ∀s, a . [sent-171, score-0.333]
67 (8) The resulting policy will have the form π(a|s, θk + αηk ) ∝ eθk φ(s,a)+αηk φ(s,a) . [sent-172, score-0.708]
68 Thus, this type of a greedy update is a special case of a natural gradient update in which the step-size approaches infinity. [sent-174, score-0.301]
69 Thus, natural gradient based policy iteration using such a very large but constant step-size does not approach greedy value function based policy iteration after the first such iteration. [sent-176, score-1.904]
70 Little is needed, however, to make the equality apply in the case of full policy iteration. [sent-177, score-0.708]
71 Let π(θk ) denote the kth policy obtained from (7) using the step-sizes α[0,k−1] and natural gradients η[0,k−1] . [sent-180, score-0.77]
72 Let π ∗ (wk ) denote the kth policy obtained from (1) with infinitely small ˜ added softness and using a value function (4), with φ∗ = φ and A(w0 ) being evaluated for π(θ0 ). [sent-181, score-0.755]
73 In the following, a more practical alternative is discussed that both avoids the related numerical issues and that allows gradual interpolation back toward conventional policy gradient iteration. [sent-194, score-0.87]
74 The constraint affects the policy π(θk+1 ) only when θk + αηk > c, in which case the magnitude of the parameter vector is scaled down with a factor τc so that it becomes equal to c. [sent-209, score-0.733]
75 This has a diminishing effect on the resulting policy as c → ∞ because the Gibbs distribution becomes increasingly insensitive to scaling of the parameter vector when its magnitude approaches infinity: lim π(τc θ) = π(θ) , ∀τc , θ such that τc θ → ∞, θ → ∞ . [sent-210, score-0.762]
76 If the soft-greedy method uses (4) for policy evaluation, then exact equivalence in the limit is obtained when c → ∞ while maintaining α/c → ∞. [sent-212, score-0.728]
77 Lowering α interpolates toward a conventional natural gradient method. [sent-213, score-0.202]
78 Greedy policy iteration searches in the space of deterministic policies. [sent-215, score-0.844]
79 As noted, the sequence of greedy policies that is generated by such a process can be approximated arbitrarily closely with the Gibbs policy class (2) with τ → 0. [sent-216, score-0.927]
80 For this class, the parameters of all deterministic policies lie at infinity in different directions in the parameter space, whereas stochastic policies are obtained with finite parameter values (except for vanishingly narrow directions along diagonals). [sent-217, score-0.228]
81 Based on Theorems 1 and 2, we observe that the policy sequence that results from these jumps can be approximated arbitrarily closely with a natural actor-critic method using very large step-sizes. [sent-219, score-0.771]
82 Although we ignore the latter requirement, we note that the former requirement is satisfied when only deterministic attractors exist in the Gibbs policy space. [sent-223, score-0.881]
83 However, when these conditions do not hold and there is an attractor in the policy space that corresponds to a stochastic policy, there is a finite attractor in the parameter space that resides inside the ∞-radius sphere. [sent-228, score-0.845]
84 In essence, the sustained policy oscillation that results from using very large step-sizes or greedy updates in the latter two cases (2. [sent-246, score-1.36]
85 2b) is caused by sustained overshooting over the finite attractor in the policy parameter space. [sent-248, score-0.915]
86 This is because whatever initial improvement speed can be achieved with the latter due to greedy updates, the same speed can be also achieved with the former using the same basis together with very long steps and constraining. [sent-250, score-0.286]
87 This effectively corresponds to an attempt to exploit whatever remains of the special structure of a Markovian problem, making the use of a very large α in constrained policy improvement analogous to using a small λ in policy evaluation. [sent-251, score-1.483]
88 6 Empirical results In this section, we apply several variants of the natural actor-critic algorithm and some greedy policy iteration algorithms to the Tetris problem using the standard features from [12]. [sent-253, score-0.973]
89 For policy improvement, we use the original natural actor-critic (NAC) from [4], a constrained one (CNAC) that uses (11) and a very large α, and a soft-greedy policy iteration algorithm (SGPI) that uses (2). [sent-254, score-1.559]
90 For policy evaluation, we use LSPE [28] and an SVD-based batch solver (pinv). [sent-255, score-0.708]
91 5I and the policy was updated after every 100th episode. [sent-257, score-0.708]
92 We used a simple initial policy (θmaxh = θholes = −3) that scores around 20 points. [sent-259, score-0.708]
93 1 shows that with soft-greedy policy iteration (SGPI), it is in fact possible to avoid policy degradation by using a suitable amount of softening. [sent-281, score-1.577]
94 The algorithm can indeed emulate greedy updates (SGPI) and the associated policy degradation. [sent-284, score-0.852]
95 Currently, we expect that the policy oscillation or chattering phenomenon is not the main cause for neither policy degradation nor stagnation in this problem. [sent-303, score-2.091]
96 Instead, it seems that, for both greedy and gradient approaches, the explanation is related to numerical instabilities that stem possibly both from the involved estimators and from insufficient exploration. [sent-304, score-0.237]
97 Approximate policy iteration: A survey and some new methods. [sent-346, score-0.708]
98 Temporal differences-based policy iteration and applications in neurodynamic programming. [sent-377, score-0.811]
99 Q-learning and enhanced policy iteration in discounted dynamic programming. [sent-468, score-0.872]
100 Least squares policy evaluation algorithms with linear function approximac tion. [sent-475, score-0.748]
wordName wordTfidf (topN-words)
[('policy', 0.708), ('oscillation', 0.398), ('tetris', 0.17), ('cnac', 0.136), ('nac', 0.134), ('phenomenon', 0.134), ('greedy', 0.122), ('sgpi', 0.119), ('sustained', 0.11), ('iteration', 0.103), ('gradient', 0.095), ('ar', 0.094), ('reinforcement', 0.082), ('policies', 0.074), ('qk', 0.07), ('wk', 0.068), ('al', 0.066), ('ak', 0.064), ('dynamic', 0.061), ('attractors', 0.06), ('exploration', 0.059), ('approximator', 0.059), ('degradation', 0.058), ('oscillations', 0.055), ('programming', 0.055), ('bertsekas', 0.052), ('chattering', 0.051), ('nity', 0.051), ('action', 0.048), ('former', 0.047), ('stochastic', 0.047), ('improvement', 0.046), ('attractor', 0.045), ('reward', 0.042), ('observability', 0.041), ('suboptimality', 0.041), ('carlo', 0.04), ('natural', 0.04), ('evaluation', 0.04), ('monte', 0.039), ('approximate', 0.039), ('convergence', 0.038), ('toward', 0.038), ('susceptible', 0.037), ('markovian', 0.036), ('gibbs', 0.035), ('lspe', 0.034), ('stagnation', 0.034), ('thiery', 0.034), ('landscape', 0.034), ('attractive', 0.034), ('deterministic', 0.033), ('requirement', 0.033), ('optima', 0.032), ('speaking', 0.032), ('aalto', 0.03), ('busoniu', 0.03), ('farias', 0.03), ('icga', 0.03), ('maxb', 0.03), ('overshooting', 0.03), ('lim', 0.029), ('conventional', 0.029), ('severe', 0.027), ('lowering', 0.027), ('grossly', 0.027), ('manifest', 0.027), ('temporal', 0.027), ('actions', 0.026), ('finland', 0.026), ('value', 0.025), ('speed', 0.025), ('decision', 0.025), ('keeps', 0.025), ('magnitude', 0.025), ('dominating', 0.024), ('incapable', 0.024), ('implicit', 0.024), ('discount', 0.023), ('casting', 0.023), ('massachusetts', 0.023), ('approximated', 0.023), ('update', 0.022), ('caused', 0.022), ('kth', 0.022), ('updates', 0.022), ('opportunities', 0.022), ('permit', 0.022), ('understanding', 0.021), ('established', 0.021), ('whatever', 0.021), ('optimistic', 0.021), ('score', 0.021), ('fully', 0.021), ('athena', 0.02), ('differing', 0.02), ('mismatch', 0.02), ('st', 0.02), ('equivalence', 0.02), ('explanation', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
2 0.36744022 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
3 0.33889592 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
4 0.27795753 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
5 0.26604065 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
6 0.20465659 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
7 0.20337938 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
8 0.19433522 283 nips-2011-The Fixed Points of Off-Policy TD
9 0.19224146 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
10 0.16505949 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
11 0.14272735 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
12 0.1204536 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
13 0.10773539 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
14 0.10116088 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
15 0.10024027 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
16 0.09700314 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
17 0.095986255 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
18 0.092991635 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
19 0.08675988 56 nips-2011-Committing Bandits
20 0.082179412 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
topicId topicWeight
[(0, 0.175), (1, -0.243), (2, 0.127), (3, 0.297), (4, -0.387), (5, 0.11), (6, -0.004), (7, 0.047), (8, -0.079), (9, -0.015), (10, 0.038), (11, 0.041), (12, -0.085), (13, -0.081), (14, -0.15), (15, -0.093), (16, 0.086), (17, 0.122), (18, 0.039), (19, 0.035), (20, -0.073), (21, -0.071), (22, 0.061), (23, 0.014), (24, -0.109), (25, -0.062), (26, 0.028), (27, 0.026), (28, 0.042), (29, 0.036), (30, 0.078), (31, 0.049), (32, 0.083), (33, 0.024), (34, -0.005), (35, 0.017), (36, -0.037), (37, 0.066), (38, 0.0), (39, 0.031), (40, -0.048), (41, 0.019), (42, -0.022), (43, 0.003), (44, -0.017), (45, 0.067), (46, -0.068), (47, 0.003), (48, -0.01), (49, 0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.98285937 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
2 0.89482611 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
3 0.84663063 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama
Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1
4 0.79465467 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
5 0.79021937 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern
Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1
6 0.78958368 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
7 0.76377422 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
8 0.74138778 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
9 0.69982362 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
10 0.63924462 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
11 0.58742809 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
12 0.51993865 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
13 0.46193838 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
14 0.41077542 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
15 0.40225232 283 nips-2011-The Fixed Points of Off-Policy TD
16 0.38222358 256 nips-2011-Solving Decision Problems with Limited Information
17 0.34341979 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
18 0.33394337 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
19 0.33257326 268 nips-2011-Speedy Q-Learning
20 0.31711373 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies
topicId topicWeight
[(0, 0.02), (4, 0.035), (11, 0.019), (20, 0.039), (23, 0.186), (26, 0.026), (31, 0.112), (33, 0.014), (43, 0.045), (45, 0.122), (57, 0.04), (65, 0.014), (74, 0.078), (83, 0.04), (99, 0.11)]
simIndex simValue paperId paperTitle
1 0.92285681 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games
Author: Richard G. Gibson, Duane Szafron
Abstract: Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from abstract games perform better in the real game as the granularity of abstraction is increased. This paper investigates two techniques for stitching a base strategy in a coarse abstraction of the full game tree, to expert strategies in fine abstractions of smaller subtrees. We provide a general framework for creating static experts, an approach that generalizes some previous strategy stitching efforts. In addition, we show that static experts can create strong agents for both 2-player and 3-player Leduc and Limit Texas Hold’em poker, and that a specific class of static experts can be preferred among a number of alternatives. Furthermore, we describe a poker agent that used static experts and won the 3-player events of the 2010 Annual Computer Poker Competition.
2 0.90383041 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
Author: Duy Q. Vu, David Hunter, Padhraic Smyth, Arthur U. Asuncion
Abstract: The development of statistical models for continuous-time longitudinal network data is of increasing interest in machine learning and social science. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-dependent network statistics and time-varying regression coefficients. We also develop an efficient inference scheme that allows our approach to scale to large networks. On synthetic and real-world data, empirical results demonstrate that the proposed inference approach can accurately estimate the coefficients of the regression model, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods.
same-paper 3 0.83320898 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
4 0.73996818 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.73398054 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
Author: Alex K. Susemihl, Ron Meir, Manfred Opper
Abstract: Bayesian filtering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1
6 0.73345149 66 nips-2011-Crowdclustering
7 0.72936612 263 nips-2011-Sparse Manifold Clustering and Embedding
8 0.7248376 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
9 0.72401655 102 nips-2011-Generalised Coupled Tensor Factorisation
10 0.72204947 180 nips-2011-Multiple Instance Filtering
11 0.71955174 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
12 0.71905756 271 nips-2011-Statistical Tests for Optimization Efficiency
13 0.71873122 258 nips-2011-Sparse Bayesian Multi-Task Learning
14 0.71756589 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
15 0.71612763 283 nips-2011-The Fixed Points of Off-Policy TD
16 0.71601284 303 nips-2011-Video Annotation and Tracking with Active Learning
17 0.71503258 32 nips-2011-An Empirical Evaluation of Thompson Sampling
18 0.71261424 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
19 0.71188831 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
20 0.71139306 270 nips-2011-Statistical Performance of Convex Tensor Decomposition