nips nips2005 nips2005-199 knowledge-graph by maker-knowledge-mining

199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Source: pdf

Author: Sridhar Mahadevan, Mauro Maggioni

Abstract: We investigate the problem of automatically constructing efﬁcient representations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particular, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigenfunctions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying representation and policies are simultaneously learned.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate the problem of automatically constructing efﬁcient representations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. [sent-5, score-0.539]

2 Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying representation and policies are simultaneously learned. [sent-7, score-0.067]

3 This approach also yields new control learning algorithms called representation policy iteration (RPI) where both the underlying representations (basis functions) and policies are simultaneously learned. [sent-11, score-0.281]

4 Laplacian eigenfunctions also provide ways of automatically decomposing state spaces since they reﬂect bottlenecks and other global geometric invariants. [sent-12, score-0.568]

5 In this paper, we extend the earlier Laplacian approach in a new direction using the recently proposed diffusion wavelet transform (DWT), which is a compact multi-level representation of Markov diffusion processes on manifolds and graphs [4, 2]. [sent-13, score-1.188]

6 2 Technical Background This paper uses the framework of spectral graph theory [3] to build basis representations for smooth (value) functions on graphs induced by Markov decision processes. [sent-15, score-0.535]

7 Given any graph G, an obvious but poor choice of representation is the “table-lookup” orthonormal encoding, where φ(i) = [0 . [sent-16, score-0.232]

8 This representation does not reﬂect the topology of the speciﬁc graph under consideration. [sent-23, score-0.139]

9 Polynomials are another popular choice of orthonormal basis functions [5], where φ(s) = [1 s . [sent-24, score-0.389]

10 This encoding has two disadvantages: it is numerically unstable for large graphs, and is dependent on the ordering of vertices. [sent-28, score-0.054]

11 In this paper, we outline a new approach to the problem of building basis functions on graphs using Laplacian eigenfunctions and diffusion wavelets. [sent-29, score-1.126]

12 A state value function is a mapping S → R or equivalently a vector in R|S| . [sent-31, score-0.088]

13 Given a policy π : S → A mapping states to actions, its corresponding value function V π speciﬁes the expected long-term discounted sum of rewards received by the agent in any given state s when actions are chosen using the policy. [sent-32, score-0.353]

14 In ergodic MDPs, the set of transient states is empty. [sent-34, score-0.045]

15 The construction of basis functions below assumes that the Markov chain induced by a policy is a reversible random walk on the state space. [sent-35, score-0.791]

16 While some policies may not induce such Markov chains, the set of basis functions learned from a reversible random walk can still be useful in approximating value functions for (reversible or non-reversible) policies. [sent-36, score-0.671]

17 Reversible random walks greatly simplify spectral analysis since such random walks are similar to a symmetric operator on the state space. [sent-38, score-0.092]

18 1 Smooth Functions on Graphs and Value Function Representation We assume the state space can be modeled as a ﬁnite undirected weighted graph (G, E, W ), but the approach generalizes to Riemannian manifolds. [sent-40, score-0.182]

19 D will denote the diagonal matrix deﬁned by Dxx = d(x), and W the matrix deﬁned by Wxy = w(x, y) = w(y, x). [sent-42, score-0.054]

20 The L2 norm of a function on G is ||f ||2 = x∈G |f (x)|2 d(x). [sent-43, score-0.026]

21 The smoothness of a function on a graph, can be measured by the Sobolev norm ||f ||2 2 = ||f ||2 + || f ||2 = 2 2 H |f (x)|2 d(x) + x |f (x) − f (y)|2 w(x, y) . [sent-45, score-0.065]

22 We will assume that the value functions we consider have small H2 norms, except at a few points, where the gradient may be large. [sent-48, score-0.129]

23 Classical techniques, such as value iteration and policy iteration [10], represent value functions using an orthonormal basis (e1 , . [sent-50, score-0.699]

24 For a ﬁxed precision , a value function V π can be approximated as ||V π − απ ei || ≤ i i∈S( ) π with αi =< V , ei > since the ei ’s are orthonormal, and the approximation is measured in some norm, such as L2 or H2 . [sent-54, score-0.248]

25 The goal is to obtain representations in which the index set S( ) in the summation is as small as possible, for a given approximation error . [sent-55, score-0.041]

26 This hope is well founded at least when V π is smooth or piecewise smooth, since in this case it should be compressible in some well chosen basis {ei }. [sent-56, score-0.27]

27 This Laplacian is related to the notion of smoothness as above, since f, Lf = 2 2 x f (x) Lf (x) = x,y w(x, y)(f (x) − f (y)) = || f ||2 , which should be compared with (1). [sent-59, score-0.039]

28 and a corresponding orthonormal basis of eigenfunctions {ξi }i≥0 , solutions to the eigenvalue problem Lξi = λi ξi . [sent-67, score-0.654]

29 The eigenfunctions of the Laplacian can be viewed as an orthonormal basis of global Fourier smooth functions that can be used for approximating any value function on a graph. [sent-68, score-0.888]

30 These basis functions capture large-scale features of the state space, and are particularly sensitive to “bottlenecks”, a phenomenon widely studied in Riemannian geometry and spectral graph theory [3]. [sent-69, score-0.488]

31 In fact, the varia2 tional characterization of eigenvectors shows that ξi is the normalized function orthogonal to ξ0 , . [sent-71, score-0.032]

32 Hence the projection of a function f on S onto the top k eigenvectors of the Laplacian is the smoothest approximation to f , in the sense of the norm in H2 . [sent-75, score-0.067]

33 A potential drawback of Laplacian approximation is that it detects only global smoothness, and may poorly approximate a function which is not globally smooth but only piecewise smooth, or with different smoothness in different regions. [sent-76, score-0.203]

34 These drawbacks are addressed in the context of analysis with diffusion wavelets, and in fact partly motivated their construction. [sent-77, score-0.413]

35 Compute sparse factorization Hj ∼ Qj Rj , with Qj orthogonal. [sent-80, score-0.06]

36 Compute sparse factorization I − Φj+1 Φ∗ = Qj Rj , with Qj orthogonal. [sent-84, score-0.06]

37 Space constraints permit only a brief description of the construction of diffusion wavelet trees. [sent-89, score-0.673]

38 The input to the algorithm is a “precision” parameter > 0, and a weighted graph (G, E, W ). [sent-91, score-0.1]

39 The construction is based on using the natural random walk P = D−1 W on a graph and its powers to “dilate”, or “diffuse” functions on the graph, and then deﬁning an associated coarse-graining of the graph. [sent-93, score-0.384]

40 We symmetrize P by conjugation and take powers to obtain 1 1 1 1 H t = D 2 P t D− 2 = (D− 2 W D− 2 )t = (I − L)t = (1 − λi )t ξi (·)ξi (·) (2) i≥0 where {λi } and {ξi } are the eigenvalues and eigenfunctions of the Laplacian as above. [sent-94, score-0.44]

41 Hence the eigenfunctions of H t are again ξi and the ith eigenvalue is (1 − λi )t . [sent-95, score-0.362]

42 We assume that H 1 is a sparse matrix, and that the spectrum of H 1 has rapid decay. [sent-96, score-0.028]

43 The scaling functions Φj span a subspace Vj , with the property that Vj+1 ⊆ Vj , and the span of Ψj , Wj , is the orthogonal complement of Vj into j Vj+1 . [sent-98, score-0.151]

44 In matrix form, this is a sparse factorization H0 ∼ Q0 R0 , with Q0 orthonormal. [sent-101, score-0.087]

45 The columns of Q0 are an orthonormal basis of scaling functions Φ1 for the range of H0 , written 2 as a linear combination of the initial basis Φ0 . [sent-104, score-0.61]

46 We can now write H0 on the basis Φ1 : Φ1 ∗ ∗ H1 := [H 2 ]Φ1 = Q∗ H0 H0 Q0 = R0 R0 , where we used H0 = H0 . [sent-105, score-0.199]

47 This is a compressed 0 2 (1) representation of H0 acting on the range of H0 , and it is a |G | × |G(1) | matrix. [sent-106, score-0.039]

48 We j proceed by induction: at scale j we have an orthonormal basis Φj for the rank of H 2 −1 up to precision j , represented as a linear combination of elements in Φj−1 . [sent-107, score-0.335]

49 This basis contains |G(j) | functions, where |G(j) | is comparable with the number of eigenvalues λj of j 2j H0 such that λ2 −1 ≥ . [sent-108, score-0.199]

50 We have the operator H0 represented on Φj by a |G(j) | × |G(j) | j matrix Hj , up to precision j . [sent-109, score-0.07]

51 We compute a sparse decomposition of Hj ∼ Qj Rj , and j+1 −1 obtain the next basis Φj+1 = Qj = Hj Rj and represent H 2 on this basis by the j Φ ∗ matrix Hj+1 := [H 2 ]Φj+1 = Q∗ Hj Hj Qj = Rj Rj . [sent-110, score-0.453]

52 j j+1 Wavelet bases for the spaces Wj can be built analogously by factorizing IVj − Qj+1 Q∗ , j+1 which is the orthogonal projection on the complement of Vj+1 into Vj . [sent-111, score-0.111]

53 The spaces can be further split to obtain wavelet packets [2]. [sent-112, score-0.302]

54 A Fast Diffusion Wavelet Transform allows expanding in O(n) (where n is the number of vertices) computations any function in the wavelet, or wavelet packet, basis, and efﬁciently search for the most suitable basis set. [sent-113, score-0.425]

55 Diffusion wavelets and wavelet packets are a very efﬁcient tool for representation and approximation of functions on manifolds and graphs [4, 2], generalizing to these general spaces the nice properties of wavelets that have been so successfully applied to similar tasks in Euclidean spaces. [sent-114, score-1.218]

56 k Diffusion wavelets allow computing H 2 f for any ﬁxed f , in order O(kn). [sent-115, score-0.309]

57 This is nontrivial because while the matrix H is sparse, large powers of it are not, and the computation H · H . [sent-116, score-0.105]

58 This is remarkable since in general the matrix (I − H 1 )−1 is full and only writing down the entries would take time O(n2 ). [sent-126, score-0.027]

59 It is the multiscale compression scheme that allows to efﬁciently represent (I − H 1 )−1 in compress form, taking advantage of the smoothness of the entries of the matrix. [sent-127, score-0.091]

60 We use this approach to develop a faster policy evaluation step for solving MDPs described in [6] 5 Experiments Figure 2 contrasts Laplacian eigenfunctions and diffusion wavelet basis functions in a three room grid world environment. [sent-129, score-1.642]

61 Laplacian eigenfunctions were produced by solving Lf = λf , where L is the combinatorial Laplacian, whereas diffusion wavelet basis functions were produced using the algorithm described in Figure 1. [sent-130, score-1.391]

62 The input to both methods is an undirected graph, where edges connect states reachable through a single (reversible) action. [sent-131, score-0.071]

63 Such graphs can be easily learned from a sample of transitions, such as that generated by RL agents while exploring the environment in early phases of policy learning. [sent-132, score-0.262]

64 Figure 3 compares the approximations produced in a two-room grid world MDP with 630 states. [sent-135, score-0.158]

65 The eigenfunctions contain a lot of ripples in the ﬂat region causing a large residual error. [sent-138, score-0.362]

66 In the second experiment (bottom row), Laplacian eigenfunctions work signiﬁcantly better because the value function is globally smooth. [sent-139, score-0.419]

67 Even here, the superiority of diffusion wavelets is clear. [sent-140, score-0.749]

68 5 0 50 100 150 200 Figure 3: Left column: value functions in a two room grid world MDP, where each room has 21 × 15 states connected by a door in the middle of the common wall. [sent-144, score-0.39]

69 Middle two columns: approximations produced by 5 diffusion wavelet bases and Laplacian eigenfunctions. [sent-145, score-0.726]

70 Right column: least-squares approximation error (log scale) using up to 200 basis functions (bottom curve: diffusion wavelets; top curve: Laplacian eigenfunctions). [sent-146, score-0.75]

71 In the top row, the value function corresponds to a random walk. [sent-147, score-0.032]

72 In the bottom row, the value function corresponds to the optimal policy. [sent-148, score-0.032]

73 1 Control Learning using Representation Policy Iteration This section describes results of using the automatically generated basis functions inside a control learning algorithm, in particular the Representation Policy Iteration (RPI) algorithm [8]. [sent-150, score-0.324]

74 RPI is an approximate policy iteration algorithm where the basis functions φ(s, a) handcoded in other methods, such as LSPI [5] are learned from a random walk of transitions by computing the graph Laplacian and then computing the eigenfunctions or the diffusion wavelet bases as described above. [sent-151, score-1.862]

75 One striking property of the eigenfunction and diffusion wavelet basis functions is their ability to reﬂect nonlinearities arising from “bottlenecks” in the state space. [sent-152, score-1.02]

76 Figure 4 contrasts the value function approximation produced by RPI using Laplacian eigenfunctions with that produced by a polynomial approximator. [sent-153, score-0.599]

77 The polynomial approximator yields a value function that is “blind” to the nonlinearities produced by the walls in the two room grid world MDP. [sent-154, score-0.368]

78 Figure 4: This ﬁgures compares the value functions produced by RPI using Laplacian eigenfunctions with that produced by LSPI using a polynomial approximator in a two room grid world MDP with a “bottleneck” region representing the door connecting the two rooms. [sent-155, score-0.871]

79 The Laplacian basis functions on the left clearly capture the nonlinearity arising from the bottleneck, whereas the polynomial approximator on the right smooths the value function across the walls as it is “blind” to the large-scale geometry of the environment. [sent-156, score-0.452]

80 Table 1 compares the performance of diffusion wavelets and Laplacian eigenfunctions using RPI on the classic chain MDP from [5]. [sent-157, score-1.173]

81 Here, an initial random walk of 5000 steps was carried out to generate the basis functions in a 50 state chain. [sent-158, score-0.427]

82 The chain MDP is a sequential open (or closed) chain of varying number of states, where there are two actions for moving left or right along the chain. [sent-159, score-0.158]

83 In the experiments shown, a reward of 1 was provided in states 10 and 41. [sent-160, score-0.071]

84 Given a ﬁxed k, the encoding φ(s) of a state s for Laplacian eigenfunctions is the vector comprised of the values of the k th lowest-order eigenfunctions on state k. [sent-161, score-0.866]

85 For diffusion wavelets, all the basis functions at level k were evaluated at state s to produce the encoding. [sent-162, score-0.765]

86 4 36 Table 1: This table compares the performance of RPI using diffusion wavelets and Laplacian eigenfunctions with LSPI using handcoded polynomial and radial basis functions on a 50 state chain graph MDP. [sent-181, score-1.779]

87 Each row reﬂects the performance of either RPI using learned basis functions or LSPI with a handcoded basis function (values in parentheses indicate the number of basis functions used for each architecture). [sent-182, score-0.927]

88 The two numbers reported are steps to convergence and the error in the learned policy (number of incorrect actions), averaged over 5 runs. [sent-183, score-0.207]

89 Laplacian and diffusion wavelet basis functions provide a more stable performance at both the low end and at the higher end, as compared to the handcoded basis functions. [sent-184, score-1.245]

90 As the number of basis functions are increased, RPI with Laplacian basis functions takes longer to converge, but learns a more accurate policy. [sent-185, score-0.592]

91 Diffusion wavelets converge slower as the number of basis functions is increased, giving the best results overall with 19 basis functions. [sent-186, score-0.804]

92 Unlike Laplacian eigenfunctions, the policy error is not monotonically decreasing as the number of bases functions is increased. [sent-187, score-0.319]

93 LSPI with RBF is unstable at the low end, converging to a very poor policy for 6 basis functions. [sent-189, score-0.439]

94 LSPI with a 5 degree polynomial approximator works reasonably well, but its performance noticeably degrades at higher degrees, converging to a very poor policy in one step for k = 15 and k = 25. [sent-190, score-0.311]

95 6 Future Work We are exploring many extensions of this framework, including extensions to factored MDPs, approximating action value functions as well as large state spaces by exploiting symmetries deﬁned by a group of automorphisms of the graph. [sent-191, score-0.254]

96 These enhancements will facilitate efﬁcient construction of eigenfunctions and diffusion wavelets. [sent-192, score-0.809]

97 For large state spaces, one can randomly subsample the graph, construct the eigenfunctions of the Laplacian or the diffusion wavelets on the subgraph, and then interpolate these functions using the Nystr¨ m approximation and related low-rank linear algebraic methods. [sent-193, score-1.278]

98 In experiments o on the classic inverted pendulum control task, the Nystr¨ m approximation yielded excelo lent results compared to radial basis functions, learning a more stable policy with a smaller number of samples. [sent-194, score-0.422]

99 Fast direct policy evaluation using multiscale Markov Diffusion Processes. [sent-238, score-0.234]

100 Samuel meets Amarel: Automating value function approximation using global state space analysis. [sent-252, score-0.156]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('diffusion', 0.413), ('eigenfunctions', 0.362), ('wavelets', 0.309), ('laplacian', 0.287), ('rpi', 0.278), ('wavelet', 0.226), ('lspi', 0.222), ('basis', 0.199), ('policy', 0.182), ('rj', 0.123), ('handcoded', 0.111), ('qj', 0.105), ('graph', 0.1), ('functions', 0.097), ('orthonormal', 0.093), ('hj', 0.089), ('reversible', 0.088), ('mdp', 0.084), ('powers', 0.078), ('walk', 0.075), ('lf', 0.073), ('vj', 0.065), ('chain', 0.06), ('state', 0.056), ('bottlenecks', 0.056), ('lap', 0.056), ('maggioni', 0.056), ('graphs', 0.055), ('room', 0.054), ('markov', 0.053), ('multiscale', 0.052), ('approximator', 0.052), ('pss', 0.048), ('smooth', 0.048), ('produced', 0.047), ('grid', 0.047), ('states', 0.045), ('riemannian', 0.044), ('rss', 0.044), ('ei', 0.044), ('precision', 0.043), ('polynomial', 0.043), ('manifolds', 0.042), ('poly', 0.041), ('approximation', 0.041), ('bases', 0.04), ('smoothness', 0.039), ('representation', 0.039), ('yale', 0.039), ('spaces', 0.039), ('actions', 0.038), ('dilations', 0.037), ('packets', 0.037), ('vfa', 0.037), ('spectral', 0.036), ('rbf', 0.035), ('df', 0.035), ('mdps', 0.035), ('world', 0.035), ('construction', 0.034), ('converging', 0.034), ('orthogonal', 0.032), ('coifman', 0.032), ('eig', 0.032), ('bump', 0.032), ('iteration', 0.032), ('value', 0.032), ('factorization', 0.032), ('fourier', 0.031), ('encoding', 0.03), ('approximating', 0.03), ('walls', 0.029), ('nonlinearities', 0.029), ('geodesic', 0.029), ('compares', 0.029), ('sparse', 0.028), ('policies', 0.028), ('massachusetts', 0.028), ('automatically', 0.028), ('nystr', 0.027), ('wp', 0.027), ('superiority', 0.027), ('bottleneck', 0.027), ('contrasts', 0.027), ('matrix', 0.027), ('global', 0.027), ('ect', 0.026), ('norm', 0.026), ('reward', 0.026), ('undirected', 0.026), ('door', 0.026), ('globally', 0.025), ('learned', 0.025), ('smoother', 0.025), ('manifold', 0.024), ('generalizing', 0.024), ('unstable', 0.024), ('blind', 0.023), ('piecewise', 0.023), ('scaling', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Author: Sridhar Mahadevan, Mauro Maggioni

2 0.38703603 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman

Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we deﬁne a diffusion distance between any two data points and show that the low dimensional representation of the data by the ﬁrst few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reﬂecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justiﬁcation for the success of spectral clustering and dimensional reduction algorithms based on these ﬁrst few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical ﬁndings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1

3 0.19576272 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI

Author: Ofer Pasternak, Nathan Intrator, Nir Sochen, Yaniv Assaf

Abstract: Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) is a non invasive method for brain neuronal ﬁbers delineation. Here we show a modiﬁcation for DT-MRI that allows delineation of neuronal ﬁbers which are inﬁltrated by edema. We use the Muliple Tensor Variational (MTV) framework which replaces the diffusion model of DT-MRI with a multiple component model and ﬁts it to the signal attenuation with a variational regularization mechanism. In order to reduce free water contamination we estimate the free water compartment volume fraction in each voxel, remove it, and then calculate the anisotropy of the remaining compartment. The variational framework was applied on data collected with conventional clinical parameters, containing only six diffusion directions. By using the variational framework we were able to overcome the highly ill posed ﬁtting. The results show that we were able to ﬁnd ﬁbers that were not found by DT-MRI.

4 0.15534535 158 nips-2005-Products of ``Edge-perts

Author: Max Welling, Peter V. Gehler

Abstract: Images represent an important and abundant source of data. Understanding their statistical structure has important applications such as image compression and restoration. In this paper we propose a particular kind of probabilistic model, dubbed the “products of edge-perts model” to describe the structure of wavelet transformed images. We develop a practical denoising algorithm based on a single edge-pert and show state-ofthe-art denoising performance on benchmark images. 1

5 0.13625659 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classiﬁcation problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a ﬁnite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classiﬁcation methods can be applied to ﬁnd policies in the reinforcement learning problem. 1

6 0.12525195 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

7 0.11742975 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

8 0.1109932 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

9 0.10642826 144 nips-2005-Off-policy Learning with Options and Recognizers

10 0.10601389 104 nips-2005-Laplacian Score for Feature Selection

11 0.095911607 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

12 0.089822419 153 nips-2005-Policy-Gradient Methods for Planning

13 0.087692909 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

14 0.085917532 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

15 0.078492954 178 nips-2005-Soft Clustering on Graphs

16 0.077097051 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

17 0.076976448 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

18 0.067085132 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

19 0.063751981 105 nips-2005-Large-Scale Multiclass Transduction

20 0.056552455 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.194), (1, 0.079), (2, 0.141), (3, -0.1), (4, -0.295), (5, -0.03), (6, -0.022), (7, -0.189), (8, 0.172), (9, -0.151), (10, 0.18), (11, 0.155), (12, 0.065), (13, -0.119), (14, -0.199), (15, -0.098), (16, -0.144), (17, -0.152), (18, 0.191), (19, -0.017), (20, -0.139), (21, 0.072), (22, 0.019), (23, -0.139), (24, -0.042), (25, 0.049), (26, 0.079), (27, 0.08), (28, 0.11), (29, -0.007), (30, -0.069), (31, -0.007), (32, 0.075), (33, 0.098), (34, -0.02), (35, -0.104), (36, -0.004), (37, 0.025), (38, -0.005), (39, -0.009), (40, 0.034), (41, 0.053), (42, 0.025), (43, 0.025), (44, 0.023), (45, -0.068), (46, -0.018), (47, 0.049), (48, 0.02), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9704417 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Author: Sridhar Mahadevan, Mauro Maggioni

2 0.74143064 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman

3 0.71468186 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI

Author: Ofer Pasternak, Nathan Intrator, Nir Sochen, Yaniv Assaf

4 0.44628537 158 nips-2005-Products of ``Edge-perts

Author: Max Welling, Peter V. Gehler

5 0.36338079 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

6 0.330181 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

7 0.32521042 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

8 0.31220195 144 nips-2005-Off-policy Learning with Options and Recognizers

9 0.30637014 78 nips-2005-From Weighted Classification to Policy Search

10 0.30110037 153 nips-2005-Policy-Gradient Methods for Planning

11 0.29179054 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

12 0.28706482 71 nips-2005-Fast Krylov Methods for N-Body Learning

13 0.26315337 104 nips-2005-Laplacian Score for Feature Selection

14 0.26138103 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

15 0.25686666 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

16 0.25390506 169 nips-2005-Saliency Based on Information Maximization

17 0.2438508 178 nips-2005-Soft Clustering on Graphs

18 0.24373728 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

19 0.24339455 189 nips-2005-Tensor Subspace Analysis

20 0.24314404 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.043), (10, 0.059), (11, 0.014), (27, 0.053), (31, 0.09), (34, 0.067), (41, 0.015), (55, 0.018), (65, 0.337), (69, 0.047), (73, 0.05), (77, 0.013), (88, 0.058), (91, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89069802 141 nips-2005-Norepinephrine and Neural Interrupts

Author: Peter Dayan, Angela J. Yu

Abstract: Experimental data indicate that norepinephrine is critically involved in aspects of vigilance and attention. Previously, we considered the function of this neuromodulatory system on a time scale of minutes and longer, and suggested that it signals global uncertainty arising from gross changes in environmental contingencies. However, norepinephrine is also known to be activated phasically by familiar stimuli in welllearned tasks. Here, we extend our uncertainty-based treatment of norepinephrine to this phasic mode, proposing that it is involved in the detection and reaction to state uncertainty within a task. This role of norepinephrine can be understood through the metaphor of neural interrupts. 1

same-paper 2 0.83268303 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Author: Sridhar Mahadevan, Mauro Maggioni

3 0.74718356 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

Author: Le Song, Evian Gordon, Elly Gysels

Abstract: Motor imagery attenuates EEG µ and β rhythms over sensorimotor cortices. These amplitude changes are most successfully captured by the method of Common Spatial Patterns (CSP) and widely used in braincomputer interfaces (BCI). BCI methods based on amplitude information, however, have not incoporated the rich phase dynamics in the EEG rhythm. This study reports on a BCI method based on phase synchrony rate (SR). SR, computed from binarized phase locking value, describes the number of discrete synchronization events within a window. Statistical nonparametric tests show that SRs contain signiﬁcant differences between 2 types of motor imageries. Classiﬁers trained on SRs consistently demonstrate satisfactory results for all 5 subjects. It is further observed that, for 3 subjects, phase is more discriminative than amplitude in the ﬁrst 1.5-2.0 s, which suggests that phase has the potential to boost the information transfer rate in BCIs. 1

4 0.74578893 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

Author: Jorge Silva, Jorge Marques, João Lemos

Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added beneﬁt, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1

5 0.44932586 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The deﬁnition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

6 0.4403421 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

7 0.43838304 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

8 0.42605969 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

9 0.42594126 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

10 0.42418727 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

11 0.42247126 78 nips-2005-From Weighted Classification to Policy Search

12 0.420095 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

13 0.41819784 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

14 0.41425377 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

15 0.41366768 153 nips-2005-Policy-Gradient Methods for Planning

16 0.41237992 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

17 0.411542 144 nips-2005-Off-policy Learning with Options and Recognizers

18 0.40854079 46 nips-2005-Consensus Propagation

19 0.40831178 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

20 0.40749574 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs