nips nips2002 nips2002-159 knowledge-graph by maker-knowledge-mining

159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation


Source: pdf

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. [sent-4, score-0.772]

2 In this paper we show that each of the solutions is optimal with respect to a specific objective function. [sent-5, score-0.443]

3 Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. [sent-6, score-0.551]

4 The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. [sent-7, score-0.386]

5 1 Introduction In large domains the determination of an optimal value function via a tabular representation is no longer feasible with respect to time and memory considerations. [sent-8, score-0.259]

6 Therefore, reinforcement learning (RL) algorithms are combined with linear function approximation schemes. [sent-9, score-0.196]

7 However, the different RL algorithms, that all achieve the same optimal solution in the tabular case, converge to different solutions when combined with function approximation. [sent-10, score-0.542]

8 One reason is that a characterisation of the different solutions in terms of the objective functions they optimise is partly missing. [sent-14, score-0.381]

9 In this paper we state objective functions for the TD(O) algorithm [9], the LSTD algorithm [4, 3] and the residual gradient algorithm [1] applied to the problem of policy evaluation, i. [sent-15, score-0.567]

10 the determination of the value function for a fixed policy. [sent-17, score-0.16]

11 Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. [sent-18, score-0.551]

12 We think that an analysis of the different optimisation criteria and the projection operations will be useful for determining the errors that the different algorithms achieve relative to the error of the theoretically optimal approximate solution. [sent-19, score-0.502]

13 This will yield a criterion for selecting an optimal RL algorithm. [sent-20, score-0.18]

14 For the TD(O) algorithm such error bounds with respect to a specific norm are already known [2, 10] but for the other algorithms there are no comparable results. [sent-21, score-0.3]

15 denotes the value of state Si, pt,j = p(Si' Sj, /-t(Si)), Rf = E{r(si,/-t(Si))} and "( is the discount factor. [sent-23, score-0.044]

16 As the policy /-t is fixed we will omit it in the following to make notation easier. [sent-24, score-0.316]

17 vt The fixed point V* of equation (1) can be determined iteratively with an operator T: RN -+ RN by TV n = V n + 1 = "(PV n + R. [sent-25, score-0.145]

18 (2) This iteration converges to a unique fixed point [2], that is given by V* = (I - ,,(p)-l R , (3) where (J - "(P) is invertible for every stochastic matrix P. [sent-26, score-0.381]

19 3 Approximate Policy Evaluation If the state space S gets too large the exact solution of equation (1) becomes very costly with respect to both memory and computation time. [sent-27, score-0.273]

20 IIJ> F) E RNxF is the matrix with the basis functions as columns. [sent-35, score-0.055]

21 1 The Optimal Approximate Solution If the transition probability matrix P were known, then the optimal exact solution V* = (J - ,,(P)-l R could be computed directly. [sent-38, score-0.396]

22 The optimal approximation to this solution is obtained by minimising IllJ>w - V* II with respect to w. [sent-39, score-0.4]

23 Generally a symmetric positive definite matrix D can be used to define a norm according to II . [sent-41, score-0.154]

24 The optimal solution that can be achieved with the linear function approximator IJ>w then is the orthogonal projection of V* onto [IJ>], i. [sent-43, score-0.381]

25 Then the orthogonal projection on [IJ>] according to the norm II· liD is defined as IID = 1J>(IJ>TDIJ»-lIJ>TD. [sent-47, score-0.268]

26 We denote the optimal approximate solution by vf/ = IID V*. [sent-48, score-0.302]

27 Here, 8L stands for supervised learning because quadratic error w~knF ~lllJ>w - V*111 = ~(lJ>w£L - (4) wl} minimises the weighted v*f D(lJ>w£L - V*) = ~llVgL - V*111 (5) for a given D and V*, which is the objective of a supervised learning method. [sent-50, score-0.251]

28 Note, that V* equals the expected discounted accumulated reward along a sampled trajectory under the fixed policy /-t, i. [sent-51, score-0.431]

29 These are exactly the samples obtained by the TD(l) algorithm [9]. [sent-54, score-0.039]

30 Thus, the TD(l) solution is equivalent to the optimal approximate solution. [sent-55, score-0.302]

31 Let p be a probability distribution on the state space S. [sent-58, score-0.044]

32 Furthermore, let Xi be sampled according to p, Zi be sampled according to P(Xi , ·) and ri be sampled according to r(x;). [sent-59, score-0.353]

33 ] to denote the expectation with respect to the distribution p. [sent-61, score-0.038]

34 In particular the stochastic TD(O) algorithm converges if and only if the deterministic algorithm (9) converges. [sent-66, score-0.271]

35 Furthermore, if both algorithms converge they converge to the same fixed point. [sent-67, score-0.286]

36 An iteration of the form (9) converges if all eigenvalues of the matrix 1+ aAip p lie within the unit circle [5]. [sent-68, score-0.385]

37 For a matrix Alt that has only eigenvalues with p negative real part and a learning rate at that decays according to (8) there is a t* such that the eigenvalues of I + lie inside the unit circle for all t > t* . [sent-69, score-0.442]

38 p Hence, for a decaying learning rate the deterministic TD(O) algorithm converges if all eigenvalues of Aft have a negative real part. [sent-70, score-0.3]

39 Since this requirement is not p always fulfilled the TD algorithm possibly diverges as shown in [1] . [sent-71, score-0.111]

40 This divergence is due to the positive eigenvalues of AI;D [ 8]. [sent-72, score-0.125]

41 p atA IF However, under special assumptions convergence of the TD(O) algorithm can be shown [2]. [sent-73, score-0.039]

42 Let the feature matrix E R NxF have full rank, where F :::; N, i. [sent-74, score-0.055]

43 This results in no loss of generality because the linearly dependent columns of ] with dimension larger than zero that minimises IIVj}G - V*IIDRG because II·IIDRG is now only a semi-norm. [sent-77, score-0.069]

44 p p p But for all minimising Vj}G the Bellman error is the same, i. [sent-78, score-0.191]

45 with respect to the p Bellman error all the solutions Vj}G are equivalent [8] (Proposition 1). [sent-80, score-0.26]

46 5 Synopsis of the Different Solutions In Table 1 we give a brief overview of the solutions that the different RL algorithms yield. [sent-82, score-0.25]

47 An SL solution can be computed for arbitrary weighting matrices D p induced by a sampling distribution p. [sent-83, score-0.158]

48 For the three RL algorithms (TD, LSTD, RG) solvability conditions can be either formulated in terms of the eigenvalues of the iteration matrix A or in terms of the sampling distribution p. [sent-84, score-0.406]

49 The iterative TD(O) algorithm has the most restrictive conditions for solvability both for the eigenvalues of the iteration matrix A, whose real parts must be smaller than zero, and for the sampling distribution p, which must equal the steady-state distribution 7f. [sent-85, score-0.417]

50 The LSTD method only requires invertibility of This is satisfied if has p full column rank and if the visitation distribution p samples every state s infinitely often, i. [sent-86, score-0.228]

51 In contrast to that the residual gradient algorithm converges independently of p and the concrete A15G because all these matrices have p eigenvalues with nonpositive real parts. [sent-89, score-0.391]

52 All solutions can be characterised as minimising a quadratic optimisation criterion Il w - V* liD with corresponding matrix D. [sent-91, score-0.659]

53 The SL solution optimises the weighted quadratic error (5), RG optimises the weighted Bellman error (19) and both TD and LSTD optimise the quadratic function (14) with weighting matrices D;;D and DJD respectively. [sent-92, score-0.636]

54 p(s) :;i 0 for all s E S, the solutions V can be characterised as images of the optimal solution V* under different orthogonal projections (optimal, RG) and projections that minimise a semi-norm (TD, LSTD). [sent-95, score-0.637]

55 For singular Dp see the remarks on ambiguous solutions in section 3. [sent-96, score-0.181]

56 Let us finally discuss the case of a quasi-tabular representation of the value function that is obtained for regular and let all states be visited infinitely often, i. [sent-98, score-0.1]

57 Thus, the optimal solution V* is exactly representable because V* E [ ]. [sent-102, score-0.25]

58 Moreover, every projection operator II : ~N -+ [ ] reduces to the identity. [sent-103, score-0.138]

59 Therefore, all the projection operators for the different algorithms are equivalent to the identity. [sent-104, score-0.171]

60 Hence, with a quasi-tabular representation all the algorithms converge to the optimal solution V*. [sent-105, score-0.335]

61 4 Conclusions We have presented an analysis of the solutions that are achieved by different reinforcement learning algorithms combined with linear function approximation. [sent-106, score-0.377]

62 The solutions of all the examined algorithms, TD(O), LSTD and the residual gradient algorithm, can be characterised as minimising different corresponding quadratic objective function. [sent-107, score-0.713]

63 As a consequence, each of the value functions, that one of the above algorithms converges to , can be interpreted as image of the optimal exact value function under a corresponding orthogonal projection. [sent-108, score-0.424]

64 In this general framework we have given the first characterisation of the approximate TD(O) solution in terms of the minimisation of a quadratic objective function. [sent-109, score-0.446]

65 This approach allows to view the TD(O) solution as exact solution of a projected learning problem. [sent-110, score-0.293]

66 Moreover, we have shown that the residual gradient solution and the optimal approximate solution only differ in the weighting of the error between the exact and the approximate solution. [sent-111, score-0.818]

67 In future research we intend to use the results presented here for determining the errors of the different solutions relative to the optimal approximate solution with respect to a given norm. [sent-112, score-0.521]

68 This will yield a criterion for selecting reinforcement learning algorithms that achieve optimal solution quality. [sent-113, score-0.513]

69 Learning to predict by the methods of temporal differences. [sent-162, score-0.036]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('td', 0.482), ('lstd', 0.225), ('policy', 0.207), ('ij', 0.203), ('solutions', 0.181), ('cp', 0.166), ('rl', 0.16), ('bellman', 0.151), ('minimising', 0.15), ('lid', 0.13), ('tdij', 0.13), ('reinforcement', 0.127), ('eigenvalues', 0.125), ('optimal', 0.11), ('fixed', 0.109), ('zi', 0.105), ('characterised', 0.103), ('projection', 0.102), ('solution', 0.102), ('si', 0.101), ('residual', 0.093), ('rf', 0.09), ('approximate', 0.09), ('exact', 0.089), ('converges', 0.089), ('iidrg', 0.086), ('invertibility', 0.086), ('optimises', 0.086), ('solvability', 0.086), ('ri', 0.083), ('dp', 0.082), ('rg', 0.082), ('quadratic', 0.08), ('xi', 0.077), ('schoknecht', 0.075), ('characterisation', 0.075), ('iteration', 0.071), ('sixteenth', 0.069), ('characterise', 0.069), ('minimises', 0.069), ('algorithms', 0.069), ('orthogonal', 0.067), ('optimise', 0.064), ('infinitely', 0.064), ('iid', 0.064), ('reward', 0.064), ('evaluation', 0.063), ('objective', 0.061), ('norm', 0.06), ('tabular', 0.06), ('sl', 0.06), ('stochastic', 0.057), ('wn', 0.057), ('lj', 0.057), ('weighting', 0.056), ('matrix', 0.055), ('optimisation', 0.055), ('ii', 0.054), ('converge', 0.054), ('specific', 0.053), ('decays', 0.053), ('sampled', 0.051), ('determination', 0.051), ('vj', 0.049), ('deterministic', 0.047), ('ep', 0.046), ('moreover', 0.046), ('circle', 0.045), ('gradient', 0.045), ('state', 0.044), ('error', 0.041), ('iterative', 0.041), ('transition', 0.04), ('rn', 0.04), ('st', 0.04), ('according', 0.039), ('algorithm', 0.039), ('respect', 0.038), ('tsitsiklis', 0.038), ('minimisation', 0.038), ('vit', 0.038), ('ilkd', 0.038), ('ems', 0.038), ('syst', 0.038), ('aip', 0.038), ('lagoudakis', 0.038), ('abi', 0.038), ('fulfilled', 0.038), ('representable', 0.038), ('projections', 0.037), ('temporal', 0.036), ('operator', 0.036), ('regular', 0.036), ('achieve', 0.035), ('yield', 0.035), ('criterion', 0.035), ('bertsekas', 0.034), ('bradtke', 0.034), ('visitation', 0.034), ('diverges', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

2 0.58004189 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1

3 0.25603822 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

4 0.24988046 199 nips-2002-Timing and Partial Observability in the Dopamine System

Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

Abstract: According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.

5 0.13012727 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

Author: Benjamin V. Roy, Daniela D. Farias

Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.

6 0.12590322 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

7 0.12584136 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

8 0.10137776 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

9 0.10051455 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

10 0.095857725 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

11 0.08729165 20 nips-2002-Adaptive Caching by Refetching

12 0.083113819 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

13 0.077416591 115 nips-2002-Informed Projections

14 0.076181568 134 nips-2002-Learning to Take Concurrent Actions

15 0.075987458 205 nips-2002-Value-Directed Compression of POMDPs

16 0.074513875 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

17 0.071980663 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

18 0.07187824 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

19 0.067021742 111 nips-2002-Independent Components Analysis through Product Density Estimation

20 0.063577987 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.206), (1, -0.051), (2, -0.37), (3, -0.135), (4, -0.026), (5, -0.161), (6, 0.062), (7, -0.049), (8, 0.025), (9, 0.008), (10, 0.029), (11, -0.359), (12, -0.148), (13, 0.342), (14, 0.136), (15, -0.064), (16, -0.067), (17, 0.218), (18, -0.156), (19, -0.113), (20, 0.062), (21, 0.136), (22, 0.179), (23, 0.024), (24, 0.117), (25, 0.04), (26, 0.056), (27, -0.011), (28, -0.044), (29, 0.015), (30, -0.071), (31, -0.033), (32, -0.042), (33, 0.018), (34, -0.068), (35, 0.08), (36, 0.058), (37, -0.036), (38, 0.027), (39, 0.055), (40, 0.011), (41, -0.052), (42, -0.026), (43, -0.061), (44, 0.012), (45, 0.024), (46, -0.017), (47, 0.021), (48, -0.024), (49, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96998233 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

2 0.96768504 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1

3 0.61224961 199 nips-2002-Timing and Partial Observability in the Dopamine System

Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

Abstract: According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.

4 0.55075771 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

5 0.43624917 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

Author: Benjamin V. Roy, Daniela D. Farias

Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.

6 0.35534951 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

7 0.35220486 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

8 0.31906396 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

9 0.31152093 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

10 0.2900213 205 nips-2002-Value-Directed Compression of POMDPs

11 0.27724478 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

12 0.27231655 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

13 0.25260118 20 nips-2002-Adaptive Caching by Refetching

14 0.23935162 115 nips-2002-Informed Projections

15 0.23627436 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

16 0.2276625 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

17 0.22315483 111 nips-2002-Independent Components Analysis through Product Density Estimation

18 0.22308087 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

19 0.21751343 128 nips-2002-Learning a Forward Model of a Reflex

20 0.21136712 178 nips-2002-Robust Novelty Detection with Single-Class MPM


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.215), (11, 0.013), (23, 0.032), (42, 0.128), (54, 0.124), (55, 0.054), (58, 0.112), (74, 0.07), (92, 0.072), (98, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81297123 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

2 0.78931206 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model

Author: Matthias O. Franz, Javaan S. Chahl

Abstract: The tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during self-motion. In this study, we examine whether a simplified linear model of these neurons can be used to estimate self-motion from the optic flow. We present a theory for the construction of an estimator consisting of a linear combination of optic flow vectors that incorporates prior knowledge both about the distance distribution of the environment, and about the noise and self-motion statistics of the sensor. The estimator is tested on a gantry carrying an omnidirectional vision sensor. The experiments show that the proposed approach leads to accurate and robust estimates of rotation rates, whereas translation estimates turn out to be less reliable. 1

3 0.7364791 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1

4 0.69757545 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.

5 0.68096 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

6 0.66332382 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

7 0.6449132 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

8 0.64463967 138 nips-2002-Manifold Parzen Windows

9 0.64315832 3 nips-2002-A Convergent Form of Approximate Policy Iteration

10 0.64246964 46 nips-2002-Boosting Density Estimation

11 0.63951278 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

12 0.63902444 169 nips-2002-Real-Time Particle Filters

13 0.63020742 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

14 0.62976527 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

15 0.62799227 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

16 0.62698448 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.62639743 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

18 0.62579846 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

19 0.62480605 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

20 0.62473011 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems