nips nips2010 nips2010-134 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. [sent-2, score-0.357]
2 We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. [sent-3, score-0.25]
3 We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. [sent-4, score-1.1]
4 1 Introduction Least-squares temporal difference (LSTD) learning [3, 2] is a widely used reinforcement learning (RL) algorithm for learning the value function V π of a given policy π. [sent-5, score-0.452]
5 LSTD has been successfully applied to a number of problems especially after the development of the least-squares policy iteration (LSPI) algorithm [9], which extends LSTD to control problems by using it in the policy evaluation step of policy iteration. [sent-6, score-0.983]
6 More precisely, LSTD computes the fixed point of the operator ΠT π , where T π is the Bellman operator of policy π and Π is the projection operator onto a linear function space. [sent-7, score-0.748]
7 The choice of the linear function space has a major impact on the accuracy of the value function estimated by LSTD, and thus, on the quality of the policy learned by LSPI. [sent-8, score-0.437]
8 [7] proposed an algorithm in which the state space is repeatedly projected onto a lower dimensional space based on the Bellman error and then states are aggregated in this space to define new features. [sent-15, score-0.357]
9 [17] presented a method that iteratively adds features to a linear approximation architecture such that each new feature is derived from the Bellman error of the existing set of features. [sent-17, score-0.225]
10 Theoretically speaking, increasing the size of the function space can reduce the approximation error (the distance between the target function and the space) at the cost of a growth in the estimation error. [sent-20, score-0.204]
11 1 In this paper, we follow a different approach based on random projections [21]. [sent-27, score-0.211]
12 In particular, we study the performance of LSTD with random projections (LSTD-RP). [sent-28, score-0.211]
13 Given a high-dimensional linear space F, LSTD-RP learns the value function of a given policy from a small (relative to the dimension of F) number of samples in a space G of lower dimension obtained by linear random projection of the features of F. [sent-29, score-0.914]
14 We prove that solving the problem in the low dimensional random space instead of the original high-dimensional space reduces the estimation error at the price of a “controlled” increase in the approximation error of the original space F. [sent-30, score-0.425]
15 2 Preliminaries For a measurable space with domain X , we let S(X ) and B(X ; L) denote the set of probability measures over X and the space of measurable functions with domain X and bounded in absolute value by 0 < L < ∞, respectively. [sent-34, score-0.253]
16 A deterministic policy π : X → A is a mapping from states to actions. [sent-42, score-0.347]
17 Under a policy π, the MDP M is reduced to a Markov chain Mπ = X , Rπ , P π , γ with reward Rπ (x) = r x, π(x) , transition kernel P π (·|x) = P · |x, π(x) , and stationary distribution ρπ (if it admits one). [sent-43, score-0.565]
18 The value function of a policy π, V π , is the unique fixed-point of the Bellman operator T π : B(X ; Vmax = Rmax ) → B(X ; Vmax ) defined 1−γ by (T π V )(x) = Rπ (x) + γ X P π (dy|x)V (y). [sent-44, score-0.424]
19 We define the orthogonal projection of V onto the space F w. [sent-58, score-0.265]
20 From F we can generate a d-dimensional (d < D) random space G = {gβ | gβ (·) = Ψ (·) β, β ∈ Rd }, where the feature vector Ψ (·) = ψ1 (·), . [sent-62, score-0.165]
21 Similar to the space F, we define the orthogonal projection of V onto the space G w. [sent-71, score-0.338]
22 We show that solving the problem in the low dimensional space instead of the original high-dimensional space reduces the estimation error at the price of a “controlled” increase in the approximation error. [sent-77, score-0.247]
23 We use the linear spaces F and G with dimensions D and d (d < D) as defined in Section 2. [sent-80, score-0.179]
24 Since in the following the policy is fixed, we drop the dependency of Rπ , P π , V π , and T π on π and simply use R, P , V , and T . [sent-81, score-0.307]
25 Let {Xt }n be a sample path (or trajectory) of size n generated by the Markov t=1 2 chain Mπ , and let v ∈ Rn and r ∈ Rn , defined as vt = V (Xt ) and rt = R(Xt ), be the value and reward vectors of this trajectory. [sent-82, score-0.215]
26 We denote by ΠG : Rn → Gn the orthogonal projection onto Gn , defined by ΠG y = arg minz∈Gn ||y − n 1 2 z||n , where ||y||2 = n t=1 yt . [sent-87, score-0.224]
27 Similarly, we can define the orthogonal projection onto Fn = n D {Φα | α ∈ R } as ΠF y = arg minz∈Fn ||y − z||n , where Φ = [φ(X1 ) ; . [sent-88, score-0.224]
28 Note that for any y ∈ Rn , the orthogonal projections ΠG y and t=1 ΠF y exist and are unique. [sent-92, score-0.2]
29 Pathwise-LSTD takes a single trajectory {Xt }n of size n generated by the Markov chain as input and returns the fixed point of the t=1 empirical operator ΠG T , where T is the pathwise Bellman operator defined as T y = r + γ P y. [sent-94, score-0.318]
30 ˆ ˆ ˆ ˆ ˆ Note that the uniqueness of v does not imply the uniqueness of the parameter β such that v = Ψβ. [sent-97, score-0.226]
31 samples from N (0, 1/d) O(dD) • the low-dim feature matrix Ψn×d = [Ψ (X1 ) ; . [sent-104, score-0.167]
32 ; Ψ (Xn ) ; 0 ] O(nd) ˜d×1 = Ψ r ˜ • Ad×d = Ψ (Ψ − γΨ ) , b O(nd + nd2 ) + O(nd) ˆ ˆ ˜ ˜ b ˜ ˜ b return either β = A−1˜ or β = A+˜ (A+ is the Moore-Penrose pseudo-inverse of A) O(d2 + d3 ) Figure 1: The pseudo-code of the LSTD with random projections (LSTD-RP) algorithm. [sent-110, score-0.211]
33 We then derive a condition on the number of samples to guarantee the uniqueness of the LSTD-RP solution. [sent-120, score-0.189]
34 Finally, from the Markov design bound we obtain generalization bounds when the Markov chain has a stationary distribution. [sent-121, score-0.287]
35 Let F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2. [sent-124, score-0.179]
36 Let {Xt }n be a sample path generated by the Markov chain Mπ , and v, v ∈ Rn be the vectors ˆ t=1 whose components are the value function and the LSTD-RP solution at {Xt }n . [sent-125, score-0.175]
37 both the random sample path and the random projection), v satisfies ˆ ||v−ˆ||n ≤ v 1 ||v − ΠF v||n + 1 − γ2 8 log(8n/δ) γVmax L m(ΠF v) + d 1−γ 3 d νn 8 log(4d/δ) 1 + , n n (1) where the random variable νn is the smallest strictly positive eigenvalue of the sample-based Gram 1 matrix n Ψ Ψ. [sent-129, score-0.503]
38 Let F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2. [sent-133, score-0.179]
39 The proof relies on the application of a variant of Johnson-Lindenstrauss (JL) lemma which states that the inner-products are approximately preserved by the application of the random matrix 8 A (see e. [sent-140, score-0.22]
40 Note that m(ΠF v) highly depends on the value function V that is being approximated and the features of the space F. [sent-170, score-0.16]
41 It is important to carefully tune the value of d as both the estimation error and the additional approximation error in Eq. [sent-171, score-0.181]
42 For instance, while a small value of d significantly reduces the estimation error (and the need for samples), it may amplify the additional approximation error term, and thus, reduce the advantage of LSTD-RP over LSTD. [sent-173, score-0.181]
43 As discussed in the introduction, when the dimensionality D of F is much bigger than the number of samples n, the learning algorithms are likely to overfit the data. [sent-184, score-0.193]
44 Note that Assumption 1 is likely to hold whenever D n, because in this case we 1 can expect that the features to be independent enough on {Xi }n so that the rank of n Φ Φ to be i=1 bigger than n (e. [sent-190, score-0.207]
45 Under Assumption 1 we can remove the empirical approximation error term in Theorem 1 and deduce the following result. [sent-193, score-0.162]
46 the random sample path and the random space), v satisfies ˆ ||v − v ||n ≤ ˆ 4. [sent-198, score-0.169]
47 n n d νn Uniqueness of the LSTD-RP Solution While the results in the previous section hold for any Markov chain, in this section we assume that the Markov chain Mπ admits a stationary distribution ρ and is exponentially fast β-mixing with ¯ ¯ parameters β, b, κ, i. [sent-200, score-0.218]
48 As shown in [11, 10], if ρ exists, it would be possible to derive a condition for the existence and uniqueness of the LSTD solution depending on the number of samples and the smallest eigenvalue of the Gram matrix defined according to the stationary distribution ρ, i. [sent-207, score-0.63]
49 Note that as D increases, the smallest eigenvalue of G is likely to become smaller and smaller. [sent-211, score-0.227]
50 In the following we prove that the smallest eigenvalue of the Gram matrix H ∈ Rd×d , Hij = ψi (x)ψj (x)ρ(dx) in the random space G is indeed bigger than the smallest eigenvalue of G with high probability. [sent-214, score-0.751]
51 Let δ > 0 and F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2 with D > d + 2 2d log(2/δ) + 2 log(2/δ). [sent-216, score-0.179]
52 Let the elements of the projection matrix A be Gaussian random variables drawn from N (0, 1/d). [sent-217, score-0.209]
53 Let the Markov chain Mπ admit a stationary distribution ρ. [sent-218, score-0.248]
54 Let G and H be the Gram matrices according to ρ for the spaces F and G, and ω and χ be their smallest eigenvalues. [sent-219, score-0.223]
55 Note that if the elements of A are drawn from the Gaussian distribution N (0, 1/d), the elements of B are standard Gaussian random variables, and thus, the smallest eigenvalue of AA , ξ, can be written as ξ = s2 (B )/d. [sent-229, score-0.281]
56 For a D × d random matrix with independent standard normal random variables, such as B , we have with probability 1 − δ (see [19] for more details) smin (B ) ≥ √ √ D− d− 2 log(2/δ) . [sent-233, score-0.233]
57 Thus, we can conclude that with high probability the smallest eigenvalue χ of the Gram matrix H of the randomly generated low-dimensional space G is bigger than the smallest eigenvalue ω of the Gram matrix G of the high-dimensional space F. [sent-243, score-0.823]
58 Let δ > 0 and F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2 with D > d + 2 2d log(2/δ) + 2 log(2/δ). [sent-245, score-0.179]
59 Let the elements of the projection matrix A be Gaussian random variables drawn from N (0, 1/d). [sent-246, score-0.209]
60 Let the Markov chain Mπ admit a stationary distribution ρ. [sent-247, score-0.248]
61 Let G be the Gram matrix according to ρ for space F and ω be its smallest eigenvalue. [sent-248, score-0.243]
62 Let {Xt }n be a trajectory of length n generated by a stationary β-mixing process with stationary t=1 distribution ρ. [sent-249, score-0.323]
63 If the number of samples n satisfies 288L2 d Λ(n, d, δ/2) max n> ωD Λ(n, d, δ/2) ,1 b −2 1/κ 1− d − D 2 log(2/δ) D , (13) e ¯ where Λ(n, d, δ) = 2(d + 1) log n + log δ + log+ max{18(6e)2(d+1) , β} , then with probability 1 − δ, the features ψ1 , . [sent-250, score-0.26]
64 , ||gβ ||n = 0 t=1 1 implies β = 0, and the smallest eigenvalue νn of the sample-based Gram matrix n Ψ Ψ satisifies √ √ νn ≥ ν = √ ω 2 D 1− d d − D 2 log( 2 ) δ − 6L D δ 2Λ(n, d, 2 ) max n δ Λ(n, d, 2 ) ,1 b 1/κ >0. [sent-255, score-0.28]
65 13 in [10], we can see that the number of samples needed for the 1 empirical Gram matrix n Ψ Ψ in G to be invertible with high probability is less than that for its 1 counterpart n Φ Φ in the high-dimensional space F. [sent-261, score-0.202]
66 3 Generalization Bound In this section, we show how Theorem 1 can be generalized to the entire state space X when the Markov chain Mπ has a stationary distribution ρ. [sent-263, score-0.291]
67 We consider the case in which the samples {Xt }n are obtained by following a single trajectory in the stationary regime of Mπ , i. [sent-264, score-0.266]
68 1, it is reasonable to assume that the highdimensional space F contains functions that are able to perfectly fit the value function V in any finite number n (n < D) of states {Xt }n , thus we state the following theorem under Assumption 1. [sent-268, score-0.225]
69 Let δ > 0 and F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2 with d ≥ 15 log(8n/δ). [sent-270, score-0.179]
70 Let {Xt }n be a path generated by a stationary β-mixing t=1 ˆ process with stationary distribution ρ. [sent-271, score-0.327]
71 the random sample path and the random space), ˆ ||V − T (V )||ρ ≤ 2 1 − γ2 8 log(24n/δ) 2γVmax L m(ΠF V ) + d 1−γ d ν 8 log(12d/δ) 1 + + , (15) n n 1 where ν is a lower bound on the eigenvalues of the Gram matrix n Ψ Ψ defined by Eq. [sent-276, score-0.291]
72 The proof is a consequence of applying concentration of measures inequalities for β-mixing ˆ processes and linear spaces (see Corollary 18 in [10]) on the term ||V − T (V )||n , using the fact that ˆ ˆ ||V − T (V )||n ≤ ||V − V ||n , and using the bound of Corollary 1. [sent-282, score-0.26]
73 The bound of Corollary 1 and the lower bound on ν, each one holding with probability 1 − δ , thus, the statement of the theorem (Eq. [sent-283, score-0.188]
74 An interesting property of the bound in Theorem 2 is that the approximation error of V in space F, ||V − ΠF V ||ρ , does not appear and the error of the LSTD solution in the randomly projected space only depends on the dimensionality d of G and the number of samples n. [sent-286, score-0.443]
75 An interesting case here is when the dimension of F is infinite (D = ∞), so that the bound is valid for any number of samples n. [sent-290, score-0.188]
76 In [15], two approximation spaces F of infinite dimension were constructed based on a multi-resolution set of features that are rescaled and translated versions of a given mother function. [sent-291, score-0.301]
77 In the case that the mother function is a wavelet, the resulting features, called scrambled wavelets, are linear combinations of wavelets at all scales weighted by Gaussian coefficients. [sent-292, score-0.172]
78 As a results, the corresponding approximation space is a Sobolev space H s (X ) with smoothness of order s > p/2, where p is the dimension of the state space X . [sent-293, score-0.312]
79 What is important about the results of [15] is that it shows that it is possible to consider infinite dimensional function spaces for which supx ||φ(x)||2 is finite and ||α||2 is expressed in terms of the norm of fα in F. [sent-298, score-0.195]
80 In such cases, m(ΠF V ) is finite and the bound of Theorem 2, which does not contain any approximation error of V in F, holds for any n. [sent-299, score-0.17]
81 Under Assumption 1, when D = ∞, by selecting the features as described in the previous remark and optimizing the value of d as in Eq. [sent-304, score-0.159]
82 In fact, they both contain the Sobolev norm of the target function and have a similar dependency on the number of samples with a convergence rate of O(n−1/4 ) (when the smoothness of the Sobolev space in [4] is chosen to be half of the dimensionality of X ). [sent-313, score-0.186]
83 This similarity asks for further investigation on the difference between 2 -regularized methods and random projections in terms of prediction performance and computational complexity. [sent-314, score-0.211]
84 5 LSPI with Random Projections In this section, we move from policy evaluation to policy iteration and provide a performance bound for LSPI with random projections (LSPI-RP), i. [sent-315, score-0.956]
85 , a policy iteration algorithm that uses LSTD-RP at each iteration. [sent-317, score-0.369]
86 LSPI-RP starts with an arbitrary initial value function V−1 ∈ B(X ; Vmax ) and its corresponding greedy policy π0 . [sent-318, score-0.368]
87 At the first iteration, it approximates V π0 using LSTD-RP and 7 ˆ ˜ ˆ returns a function V0 , whose truncated version V0 = T (V0 ) is used to build the policy for the second ˜0 . [sent-319, score-0.349]
88 Moreover, the LSTD-RP performance bounds require the samples to be collected by following the policy under evaluation. [sent-326, score-0.383]
89 Let δ > 0 and F and G be linear spaces with dimensions D and d (d < D) as defined in Section 2 with d ≥ 15 log(8Kn/δ). [sent-329, score-0.179]
90 At each iteration k, we generate a path of size n from the stationary β-mixing process with stationary distribution ρk−1 = ρπk−1 . [sent-330, score-0.389]
91 , VK−1 ) be the sequence of value functions (truncated value functions) generated by LSPI˜ (V ˜ RP, and πK be the greedy policy w. [sent-339, score-0.397]
92 It is important to note that Assumption 1 is needed to bound the performance of LSPI independent from the use of random projections (see [10]). [sent-356, score-0.28]
93 On the other hand, Assumption 2 is explicitly related to random projections and allows us to bound the term m(ΠF V ). [sent-357, score-0.308]
94 In order for this assumption to hold, the features {ϕj }D of the high-dimensional space F should be carefully chosen so as to be j=1 linearly independent w. [sent-358, score-0.173]
95 6 Conclusions Learning in high-dimensional linear spaces is particularly appealing in RL because it allows to have a very accurate approximation of value functions. [sent-362, score-0.213]
96 In this paper, we introduced an algorithm, called LSTD-RP, in which LSTD is run in a low-dimensional space obtained by a random projection of the original high-dimensional space. [sent-364, score-0.229]
97 , the estimation error depends on the value of the low dimension) at the cost of a slight worsening in the approximation accuracy compared to the high-dimensional space. [sent-367, score-0.16]
98 We also analyzed the performance of LSPI-RP, a policy iteration algorithm that uses LSTD-RP for policy evaluation. [sent-368, score-0.676]
99 1 Note that the MDP model is needed to generate a greedy policy πk . [sent-371, score-0.339]
100 Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. [sent-378, score-0.369]
wordName wordTfidf (topN-words)
[('lstd', 0.485), ('vmax', 0.383), ('policy', 0.307), ('gram', 0.189), ('projections', 0.157), ('lspi', 0.154), ('xt', 0.134), ('stationary', 0.133), ('smallest', 0.117), ('bigger', 0.117), ('uniqueness', 0.113), ('ghavamzadeh', 0.112), ('eigenvalue', 0.11), ('spaces', 0.106), ('projection', 0.102), ('rl', 0.101), ('bellman', 0.1), ('lazaric', 0.096), ('operator', 0.088), ('maillard', 0.088), ('chain', 0.085), ('samples', 0.076), ('space', 0.073), ('remark', 0.072), ('smin', 0.072), ('markov', 0.071), ('rn', 0.069), ('bound', 0.069), ('reinforcement', 0.064), ('log', 0.063), ('gn', 0.063), ('farahmand', 0.062), ('sobolev', 0.062), ('iteration', 0.062), ('path', 0.061), ('inria', 0.06), ('features', 0.058), ('trajectory', 0.057), ('aga', 0.055), ('menache', 0.055), ('ndd', 0.055), ('parr', 0.054), ('random', 0.054), ('mdp', 0.053), ('matrix', 0.053), ('temporal', 0.052), ('wavelets', 0.052), ('supx', 0.052), ('error', 0.051), ('vk', 0.05), ('theorem', 0.05), ('approximation', 0.05), ('aa', 0.049), ('scrambled', 0.048), ('onto', 0.047), ('rd', 0.046), ('rmax', 0.046), ('dimensions', 0.045), ('lemma', 0.044), ('slowest', 0.044), ('minz', 0.044), ('mother', 0.044), ('dimension', 0.043), ('orthogonal', 0.043), ('assumption', 0.042), ('truncated', 0.042), ('keller', 0.041), ('inf', 0.041), ('reward', 0.04), ('singular', 0.04), ('states', 0.04), ('fn', 0.039), ('thorough', 0.039), ('corollary', 0.039), ('measurable', 0.039), ('feature', 0.038), ('dy', 0.037), ('dd', 0.037), ('mannor', 0.037), ('norm', 0.037), ('tted', 0.035), ('xn', 0.035), ('truncation', 0.035), ('highdimensional', 0.033), ('deduce', 0.033), ('whenever', 0.032), ('greedy', 0.032), ('arg', 0.032), ('propagated', 0.031), ('szepesv', 0.031), ('dx', 0.03), ('cost', 0.03), ('admit', 0.03), ('proof', 0.029), ('extreme', 0.029), ('value', 0.029), ('randomness', 0.028), ('linear', 0.028), ('term', 0.028), ('existence', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
2 0.31402352 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
3 0.25717348 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
4 0.24637465 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
5 0.24619505 212 nips-2010-Predictive State Temporal Difference Learning
Author: Byron Boots, Geoffrey J. Gordon
Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1
6 0.24055596 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
7 0.21852535 152 nips-2010-Learning from Logged Implicit Exploration Data
8 0.21363448 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
9 0.20272702 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.18791997 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
11 0.18056253 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
12 0.163431 233 nips-2010-Scrambled Objects for Least-Squares Regression
13 0.15108481 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.13289623 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
15 0.12655425 43 nips-2010-Bootstrapping Apprenticeship Learning
16 0.10639466 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
17 0.10620937 222 nips-2010-Random Walk Approach to Regret Minimization
18 0.099555582 148 nips-2010-Learning Networks of Stochastic Differential Equations
19 0.099480502 64 nips-2010-Distributionally Robust Markov Decision Processes
20 0.09879905 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
topicId topicWeight
[(0, 0.281), (1, -0.347), (2, 0.054), (3, 0.02), (4, 0.069), (5, -0.059), (6, 0.076), (7, 0.008), (8, -0.032), (9, -0.101), (10, 0.042), (11, -0.084), (12, -0.037), (13, 0.028), (14, -0.026), (15, -0.006), (16, 0.191), (17, -0.072), (18, 0.02), (19, 0.032), (20, 0.033), (21, 0.071), (22, -0.027), (23, 0.009), (24, -0.059), (25, 0.019), (26, -0.003), (27, -0.023), (28, -0.068), (29, -0.004), (30, 0.013), (31, -0.034), (32, -0.056), (33, 0.069), (34, 0.054), (35, 0.014), (36, 0.019), (37, 0.075), (38, -0.103), (39, 0.02), (40, 0.082), (41, 0.026), (42, 0.023), (43, -0.115), (44, 0.01), (45, -0.048), (46, 0.054), (47, -0.117), (48, -0.089), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.93176454 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
2 0.90719271 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
3 0.81728327 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
4 0.75274473 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
5 0.7284888 212 nips-2010-Predictive State Temporal Difference Learning
Author: Byron Boots, Geoffrey J. Gordon
Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1
6 0.72269011 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
7 0.71400416 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
8 0.70230287 152 nips-2010-Learning from Logged Implicit Exploration Data
9 0.69506508 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
10 0.6351428 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
11 0.56956488 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
12 0.56673592 233 nips-2010-Scrambled Objects for Least-Squares Regression
13 0.55523509 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
14 0.55220181 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
15 0.53272194 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
16 0.52479768 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
17 0.50050533 64 nips-2010-Distributionally Robust Markov Decision Processes
18 0.50001234 43 nips-2010-Bootstrapping Apprenticeship Learning
19 0.4772782 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
20 0.47291067 148 nips-2010-Learning Networks of Stochastic Differential Equations
topicId topicWeight
[(13, 0.079), (14, 0.025), (17, 0.013), (27, 0.039), (30, 0.073), (35, 0.014), (37, 0.021), (45, 0.197), (50, 0.036), (51, 0.218), (52, 0.046), (60, 0.054), (77, 0.042), (90, 0.049)]
simIndex simValue paperId paperTitle
1 0.83294541 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
Author: Jason Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan Salakhutdinov
Abstract: The max-norm was proposed as a convex matrix regularizer in [1] and was shown to be empirically superior to the trace-norm for collaborative filtering problems. Although the max-norm can be computed in polynomial time, there are currently no practical algorithms for solving large-scale optimization problems that incorporate the max-norm. The present work uses a factorization technique of Burer and Monteiro [2] to devise scalable first-order algorithms for convex programs involving the max-norm. These algorithms are applied to solve huge collaborative filtering, graph cut, and clustering problems. Empirically, the new methods outperform mature techniques from all three areas. 1
same-paper 2 0.80512321 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
3 0.75280184 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
Author: Gael Varoquaux, Alexandre Gramfort, Jean-baptiste Poline, Bertrand Thirion
Abstract: Spontaneous brain activity, as observed in functional neuroimaging, has been shown to display reproducible structure that expresses brain architecture and carries markers of brain pathologies. An important view of modern neuroscience is that such large-scale structure of coherent activity reflects modularity properties of brain connectivity graphs. However, to date, there has been no demonstration that the limited and noisy data available in spontaneous activity observations could be used to learn full-brain probabilistic models that generalize to new data. Learning such models entails two main challenges: i) modeling full brain connectivity is a difficult estimation problem that faces the curse of dimensionality and ii) variability between subjects, coupled with the variability of functional signals between experimental runs, makes the use of multiple datasets challenging. We describe subject-level brain functional connectivity structure as a multivariate Gaussian process and introduce a new strategy to estimate it from group data, by imposing a common structure on the graphical model in the population. We show that individual models learned from functional Magnetic Resonance Imaging (fMRI) data using this population prior generalize better to unseen data than models based on alternative regularization schemes. To our knowledge, this is the first report of a cross-validated model of spontaneous brain activity. Finally, we use the estimated graphical model to explore the large-scale characteristics of functional architecture and show for the first time that known cognitive networks appear as the integrated communities of functional connectivity graph. 1
4 0.7339111 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
5 0.73089802 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
6 0.73029953 117 nips-2010-Identifying graph-structured activation patterns in networks
7 0.72998524 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
8 0.72825181 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
9 0.72647524 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.72482854 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
11 0.7247138 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
12 0.72310692 148 nips-2010-Learning Networks of Stochastic Differential Equations
13 0.72304058 280 nips-2010-Unsupervised Kernel Dimension Reduction
14 0.72195309 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
15 0.72106808 233 nips-2010-Scrambled Objects for Least-Squares Regression
16 0.72101778 63 nips-2010-Distributed Dual Averaging In Networks
17 0.72019798 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
18 0.72019571 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
19 0.72018611 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
20 0.71983242 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models