nips nips2002 nips2002-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ralf Schoknecht
Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. [sent-4, score-0.772]
2 In this paper we show that each of the solutions is optimal with respect to a specific objective function. [sent-5, score-0.443]
3 Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. [sent-6, score-0.551]
4 The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. [sent-7, score-0.386]
5 1 Introduction In large domains the determination of an optimal value function via a tabular representation is no longer feasible with respect to time and memory considerations. [sent-8, score-0.259]
6 Therefore, reinforcement learning (RL) algorithms are combined with linear function approximation schemes. [sent-9, score-0.196]
7 However, the different RL algorithms, that all achieve the same optimal solution in the tabular case, converge to different solutions when combined with function approximation. [sent-10, score-0.542]
8 One reason is that a characterisation of the different solutions in terms of the objective functions they optimise is partly missing. [sent-14, score-0.381]
9 In this paper we state objective functions for the TD(O) algorithm [9], the LSTD algorithm [4, 3] and the residual gradient algorithm [1] applied to the problem of policy evaluation, i. [sent-15, score-0.567]
10 the determination of the value function for a fixed policy. [sent-17, score-0.16]
11 Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. [sent-18, score-0.551]
12 We think that an analysis of the different optimisation criteria and the projection operations will be useful for determining the errors that the different algorithms achieve relative to the error of the theoretically optimal approximate solution. [sent-19, score-0.502]
13 This will yield a criterion for selecting an optimal RL algorithm. [sent-20, score-0.18]
14 For the TD(O) algorithm such error bounds with respect to a specific norm are already known [2, 10] but for the other algorithms there are no comparable results. [sent-21, score-0.3]
15 denotes the value of state Si, pt,j = p(Si' Sj, /-t(Si)), Rf = E{r(si,/-t(Si))} and "( is the discount factor. [sent-23, score-0.044]
16 As the policy /-t is fixed we will omit it in the following to make notation easier. [sent-24, score-0.316]
17 vt The fixed point V* of equation (1) can be determined iteratively with an operator T: RN -+ RN by TV n = V n + 1 = "(PV n + R. [sent-25, score-0.145]
18 (2) This iteration converges to a unique fixed point [2], that is given by V* = (I - ,,(p)-l R , (3) where (J - "(P) is invertible for every stochastic matrix P. [sent-26, score-0.381]
19 3 Approximate Policy Evaluation If the state space S gets too large the exact solution of equation (1) becomes very costly with respect to both memory and computation time. [sent-27, score-0.273]
20 IIJ> F) E RNxF is the matrix with the basis functions as columns. [sent-35, score-0.055]
21 1 The Optimal Approximate Solution If the transition probability matrix P were known, then the optimal exact solution V* = (J - ,,(P)-l R could be computed directly. [sent-38, score-0.396]
22 The optimal approximation to this solution is obtained by minimising IllJ>w - V* II with respect to w. [sent-39, score-0.4]
23 Generally a symmetric positive definite matrix D can be used to define a norm according to II . [sent-41, score-0.154]
24 The optimal solution that can be achieved with the linear function approximator IJ>w then is the orthogonal projection of V* onto [IJ>], i. [sent-43, score-0.381]
25 Then the orthogonal projection on [IJ>] according to the norm II· liD is defined as IID = 1J>(IJ>TDIJ»-lIJ>TD. [sent-47, score-0.268]
26 We denote the optimal approximate solution by vf/ = IID V*. [sent-48, score-0.302]
27 Here, 8L stands for supervised learning because quadratic error w~knF ~lllJ>w - V*111 = ~(lJ>w£L - (4) wl} minimises the weighted v*f D(lJ>w£L - V*) = ~llVgL - V*111 (5) for a given D and V*, which is the objective of a supervised learning method. [sent-50, score-0.251]
28 Note, that V* equals the expected discounted accumulated reward along a sampled trajectory under the fixed policy /-t, i. [sent-51, score-0.431]
29 These are exactly the samples obtained by the TD(l) algorithm [9]. [sent-54, score-0.039]
30 Thus, the TD(l) solution is equivalent to the optimal approximate solution. [sent-55, score-0.302]
31 Let p be a probability distribution on the state space S. [sent-58, score-0.044]
32 Furthermore, let Xi be sampled according to p, Zi be sampled according to P(Xi , ·) and ri be sampled according to r(x;). [sent-59, score-0.353]
33 ] to denote the expectation with respect to the distribution p. [sent-61, score-0.038]
34 In particular the stochastic TD(O) algorithm converges if and only if the deterministic algorithm (9) converges. [sent-66, score-0.271]
35 Furthermore, if both algorithms converge they converge to the same fixed point. [sent-67, score-0.286]
36 An iteration of the form (9) converges if all eigenvalues of the matrix 1+ aAip p lie within the unit circle [5]. [sent-68, score-0.385]
37 For a matrix Alt that has only eigenvalues with p negative real part and a learning rate at that decays according to (8) there is a t* such that the eigenvalues of I + lie inside the unit circle for all t > t* . [sent-69, score-0.442]
38 p Hence, for a decaying learning rate the deterministic TD(O) algorithm converges if all eigenvalues of Aft have a negative real part. [sent-70, score-0.3]
39 Since this requirement is not p always fulfilled the TD algorithm possibly diverges as shown in [1] . [sent-71, score-0.111]
40 This divergence is due to the positive eigenvalues of AI;D [ 8]. [sent-72, score-0.125]
41 p atA IF However, under special assumptions convergence of the TD(O) algorithm can be shown [2]. [sent-73, score-0.039]
42 Let the feature matrix E R NxF have full rank, where F :::; N, i. [sent-74, score-0.055]
43 This results in no loss of generality because the linearly dependent columns of ] with dimension larger than zero that minimises IIVj}G - V*IIDRG because II·IIDRG is now only a semi-norm. [sent-77, score-0.069]
44 p p p But for all minimising Vj}G the Bellman error is the same, i. [sent-78, score-0.191]
45 with respect to the p Bellman error all the solutions Vj}G are equivalent [8] (Proposition 1). [sent-80, score-0.26]
46 5 Synopsis of the Different Solutions In Table 1 we give a brief overview of the solutions that the different RL algorithms yield. [sent-82, score-0.25]
47 An SL solution can be computed for arbitrary weighting matrices D p induced by a sampling distribution p. [sent-83, score-0.158]
48 For the three RL algorithms (TD, LSTD, RG) solvability conditions can be either formulated in terms of the eigenvalues of the iteration matrix A or in terms of the sampling distribution p. [sent-84, score-0.406]
49 The iterative TD(O) algorithm has the most restrictive conditions for solvability both for the eigenvalues of the iteration matrix A, whose real parts must be smaller than zero, and for the sampling distribution p, which must equal the steady-state distribution 7f. [sent-85, score-0.417]
50 The LSTD method only requires invertibility of This is satisfied if has p full column rank and if the visitation distribution p samples every state s infinitely often, i. [sent-86, score-0.228]
51 In contrast to that the residual gradient algorithm converges independently of p and the concrete A15G because all these matrices have p eigenvalues with nonpositive real parts. [sent-89, score-0.391]
52 All solutions can be characterised as minimising a quadratic optimisation criterion Il w - V* liD with corresponding matrix D. [sent-91, score-0.659]
53 The SL solution optimises the weighted quadratic error (5), RG optimises the weighted Bellman error (19) and both TD and LSTD optimise the quadratic function (14) with weighting matrices D;;D and DJD respectively. [sent-92, score-0.636]
54 p(s) :;i 0 for all s E S, the solutions V can be characterised as images of the optimal solution V* under different orthogonal projections (optimal, RG) and projections that minimise a semi-norm (TD, LSTD). [sent-95, score-0.637]
55 For singular Dp see the remarks on ambiguous solutions in section 3. [sent-96, score-0.181]
56 Let us finally discuss the case of a quasi-tabular representation of the value function that is obtained for regular and let all states be visited infinitely often, i. [sent-98, score-0.1]
57 Thus, the optimal solution V* is exactly representable because V* E [ ]. [sent-102, score-0.25]
58 Moreover, every projection operator II : ~N -+ [ ] reduces to the identity. [sent-103, score-0.138]
59 Therefore, all the projection operators for the different algorithms are equivalent to the identity. [sent-104, score-0.171]
60 Hence, with a quasi-tabular representation all the algorithms converge to the optimal solution V*. [sent-105, score-0.335]
61 4 Conclusions We have presented an analysis of the solutions that are achieved by different reinforcement learning algorithms combined with linear function approximation. [sent-106, score-0.377]
62 The solutions of all the examined algorithms, TD(O), LSTD and the residual gradient algorithm, can be characterised as minimising different corresponding quadratic objective function. [sent-107, score-0.713]
63 As a consequence, each of the value functions, that one of the above algorithms converges to , can be interpreted as image of the optimal exact value function under a corresponding orthogonal projection. [sent-108, score-0.424]
64 In this general framework we have given the first characterisation of the approximate TD(O) solution in terms of the minimisation of a quadratic objective function. [sent-109, score-0.446]
65 This approach allows to view the TD(O) solution as exact solution of a projected learning problem. [sent-110, score-0.293]
66 Moreover, we have shown that the residual gradient solution and the optimal approximate solution only differ in the weighting of the error between the exact and the approximate solution. [sent-111, score-0.818]
67 In future research we intend to use the results presented here for determining the errors of the different solutions relative to the optimal approximate solution with respect to a given norm. [sent-112, score-0.521]
68 This will yield a criterion for selecting reinforcement learning algorithms that achieve optimal solution quality. [sent-113, score-0.513]
69 Learning to predict by the methods of temporal differences. [sent-162, score-0.036]
wordName wordTfidf (topN-words)
[('td', 0.482), ('lstd', 0.225), ('policy', 0.207), ('ij', 0.203), ('solutions', 0.181), ('cp', 0.166), ('rl', 0.16), ('bellman', 0.151), ('minimising', 0.15), ('lid', 0.13), ('tdij', 0.13), ('reinforcement', 0.127), ('eigenvalues', 0.125), ('optimal', 0.11), ('fixed', 0.109), ('zi', 0.105), ('characterised', 0.103), ('projection', 0.102), ('solution', 0.102), ('si', 0.101), ('residual', 0.093), ('rf', 0.09), ('approximate', 0.09), ('exact', 0.089), ('converges', 0.089), ('iidrg', 0.086), ('invertibility', 0.086), ('optimises', 0.086), ('solvability', 0.086), ('ri', 0.083), ('dp', 0.082), ('rg', 0.082), ('quadratic', 0.08), ('xi', 0.077), ('schoknecht', 0.075), ('characterisation', 0.075), ('iteration', 0.071), ('sixteenth', 0.069), ('characterise', 0.069), ('minimises', 0.069), ('algorithms', 0.069), ('orthogonal', 0.067), ('optimise', 0.064), ('infinitely', 0.064), ('iid', 0.064), ('reward', 0.064), ('evaluation', 0.063), ('objective', 0.061), ('norm', 0.06), ('tabular', 0.06), ('sl', 0.06), ('stochastic', 0.057), ('wn', 0.057), ('lj', 0.057), ('weighting', 0.056), ('matrix', 0.055), ('optimisation', 0.055), ('ii', 0.054), ('converge', 0.054), ('specific', 0.053), ('decays', 0.053), ('sampled', 0.051), ('determination', 0.051), ('vj', 0.049), ('deterministic', 0.047), ('ep', 0.046), ('moreover', 0.046), ('circle', 0.045), ('gradient', 0.045), ('state', 0.044), ('error', 0.041), ('iterative', 0.041), ('transition', 0.04), ('rn', 0.04), ('st', 0.04), ('according', 0.039), ('algorithm', 0.039), ('respect', 0.038), ('tsitsiklis', 0.038), ('minimisation', 0.038), ('vit', 0.038), ('ilkd', 0.038), ('ems', 0.038), ('syst', 0.038), ('aip', 0.038), ('lagoudakis', 0.038), ('abi', 0.038), ('fulfilled', 0.038), ('representable', 0.038), ('projections', 0.037), ('temporal', 0.036), ('operator', 0.036), ('regular', 0.036), ('achieve', 0.035), ('yield', 0.035), ('criterion', 0.035), ('bertsekas', 0.034), ('bradtke', 0.034), ('visitation', 0.034), ('diverges', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
Author: Ralf Schoknecht
Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1
2 0.58004189 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Author: Ralf Schoknecht, Artur Merke
Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1
3 0.25603822 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
4 0.24988046 199 nips-2002-Timing and Partial Observability in the Dopamine System
Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky
Abstract: According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.
5 0.13012727 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
Author: Benjamin V. Roy, Daniela D. Farias
Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.
6 0.12590322 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
7 0.12584136 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
8 0.10137776 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
9 0.10051455 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
10 0.095857725 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
11 0.08729165 20 nips-2002-Adaptive Caching by Refetching
12 0.083113819 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
13 0.077416591 115 nips-2002-Informed Projections
14 0.076181568 134 nips-2002-Learning to Take Concurrent Actions
15 0.075987458 205 nips-2002-Value-Directed Compression of POMDPs
16 0.074513875 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
17 0.071980663 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
18 0.07187824 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
19 0.067021742 111 nips-2002-Independent Components Analysis through Product Density Estimation
20 0.063577987 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
topicId topicWeight
[(0, -0.206), (1, -0.051), (2, -0.37), (3, -0.135), (4, -0.026), (5, -0.161), (6, 0.062), (7, -0.049), (8, 0.025), (9, 0.008), (10, 0.029), (11, -0.359), (12, -0.148), (13, 0.342), (14, 0.136), (15, -0.064), (16, -0.067), (17, 0.218), (18, -0.156), (19, -0.113), (20, 0.062), (21, 0.136), (22, 0.179), (23, 0.024), (24, 0.117), (25, 0.04), (26, 0.056), (27, -0.011), (28, -0.044), (29, 0.015), (30, -0.071), (31, -0.033), (32, -0.042), (33, 0.018), (34, -0.068), (35, 0.08), (36, 0.058), (37, -0.036), (38, 0.027), (39, 0.055), (40, 0.011), (41, -0.052), (42, -0.026), (43, -0.061), (44, 0.012), (45, 0.024), (46, -0.017), (47, 0.021), (48, -0.024), (49, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.96998233 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
Author: Ralf Schoknecht
Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1
2 0.96768504 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Author: Ralf Schoknecht, Artur Merke
Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1
3 0.61224961 199 nips-2002-Timing and Partial Observability in the Dopamine System
Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky
Abstract: According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.
4 0.55075771 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
5 0.43624917 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
Author: Benjamin V. Roy, Daniela D. Farias
Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.
6 0.35534951 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
7 0.35220486 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
8 0.31906396 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
9 0.31152093 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
10 0.2900213 205 nips-2002-Value-Directed Compression of POMDPs
11 0.27724478 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
12 0.27231655 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
13 0.25260118 20 nips-2002-Adaptive Caching by Refetching
14 0.23935162 115 nips-2002-Informed Projections
15 0.23627436 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
16 0.2276625 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
17 0.22315483 111 nips-2002-Independent Components Analysis through Product Density Estimation
18 0.22308087 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
19 0.21751343 128 nips-2002-Learning a Forward Model of a Reflex
20 0.21136712 178 nips-2002-Robust Novelty Detection with Single-Class MPM
topicId topicWeight
[(8, 0.215), (11, 0.013), (23, 0.032), (42, 0.128), (54, 0.124), (55, 0.054), (58, 0.112), (74, 0.07), (92, 0.072), (98, 0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.81297123 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
Author: Ralf Schoknecht
Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1
Author: Matthias O. Franz, Javaan S. Chahl
Abstract: The tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during self-motion. In this study, we examine whether a simplified linear model of these neurons can be used to estimate self-motion from the optic flow. We present a theory for the construction of an estimator consisting of a linear combination of optic flow vectors that incorporates prior knowledge both about the distance distribution of the environment, and about the noise and self-motion statistics of the sensor. The estimator is tested on a gantry carrying an omnidirectional vision sensor. The experiments show that the proposed approach leads to accurate and robust estimates of rotation rates, whereas translation estimates turn out to be less reliable. 1
3 0.7364791 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Author: Ralf Schoknecht, Artur Merke
Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1
4 0.69757545 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
5 0.68096 161 nips-2002-PAC-Bayes & Margins
Author: John Langford, John Shawe-Taylor
Abstract: unkown-abstract
6 0.66332382 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
7 0.6449132 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
8 0.64463967 138 nips-2002-Manifold Parzen Windows
9 0.64315832 3 nips-2002-A Convergent Form of Approximate Policy Iteration
10 0.64246964 46 nips-2002-Boosting Density Estimation
11 0.63951278 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.63902444 169 nips-2002-Real-Time Particle Filters
13 0.63020742 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
14 0.62976527 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
15 0.62799227 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
16 0.62698448 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
17 0.62639743 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
18 0.62579846 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
19 0.62480605 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
20 0.62473011 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems