nips nips2010 nips2010-160 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. [sent-3, score-0.171]
2 We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). [sent-4, score-0.288]
3 We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. [sent-7, score-0.717]
4 Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. [sent-8, score-1.896]
5 LARS-TD provides a homotopy method for finding the L1 regularized linear fixed point formulated by Kolter and Ng. [sent-15, score-0.34]
6 We reformulate the L1 regularized linear fixed point as a linear complementarity problem (LCP). [sent-16, score-0.266]
7 In addition, we can take advantage of the “warm start” capability of LCP solvers to produce algorithms that are better suited to the sequential nature of policy improvement than LARS-TD, which must start from scratch for each new policy. [sent-19, score-0.649]
8 We then review L1 regularization and feature selection for regression problems. [sent-21, score-0.107]
9 We defer discussion of L1 regularization and feature selection for reinforcement learning (RL) until section 3. [sent-23, score-0.148]
10 A policy π for M is a mapping from states to actions π : s → a and the transition matrix induced by π is denoted P π . [sent-28, score-0.536]
11 The value function at state s for policy π is the expected total γ-discounted reward for following π from s. [sent-30, score-0.523]
12 In matrix-vector form, this is written: V π = T π V π = R + γP π V π , π where T is the Bellman operator for policy π and V π is the fixed point of this operator. [sent-31, score-0.52]
13 s ∈S Of the many algorithms that exist for finding π ∗ , policy iteration is most relevant to the presentation herein. [sent-33, score-0.628]
14 For any policy πj , policy iteration computes V πj , then determines πj+1 as the “greedy” policy with respect to V πj : P (s |s, a)V πj (s )]. [sent-34, score-1.583]
15 For an exact representation of each V πj , the algorithm will converge to an optimal policy and the unique, optimal value function V ∗ . [sent-36, score-0.491]
16 2 L1 Regularization and Feature Selection in Regression In regression, the L1 regularized least squares problem is defined as: 1 w = arg min Φx − y 2 + β x 1 , (1) 2 x∈Rk 2 where y ∈ Rn is the target function and β ∈ R≥0 is a regularization parameter. [sent-48, score-0.135]
17 This is a motivating factor for the least angle regression (LARS) algorithm [5], which can be thought of as a homotopy method for solving the Lasso for all nonnegative values of β. [sent-56, score-0.271]
18 We do not repeat the details of the algorithm here, but point out that this is easier than it might sound at first because the homotopy path in β-space is piecewise linear (with finitely many segments). [sent-57, score-0.338]
19 3 LCP and BLCP Given a square matrix M and a vector q, a linear complementarity problem (LCP) seeks vectors w ≥ 0 and z ≥ 0 with wT z = 0 and w = q + M z. [sent-62, score-0.157]
20 The bounded linear complementarity problem (BLCP) [6] includes box constraints on z. [sent-65, score-0.157]
21 The BLCP computes w and z where w = q + M z and each variable zi meets one of the following conditions: zi = u i z i = li l i < zi < u i =⇒ =⇒ =⇒ wi ≤ 0 wi ≥ 0 wi = 0 (2a) (2b) (2c) with bounds −∞ ≤ li < ui ≤ ∞. [sent-66, score-0.144]
22 The columns of M can be thought of as the “active” columns and the procedure of swapping columns in and out of M can be thought of as a pivoting operation, analogous to pivots in the simplex algorithm. [sent-77, score-0.225]
23 In particular, they noted the solution to the minimization problem in equation (1) has the form: x = (ΦT Φ)−1 ΦT y + (ΦT Φ)−1 (−c) , w q M z where the vector −c follows the constraints in equation (2) with li = −β and ui = β. [sent-83, score-0.129]
24 Although we describe the equivalence between the BLCP and LARS optimality conditions using M ≡ (ΦT Φ)−1 , the inverse can take place inside the BLCP algorithm and this operation is feasible and efficient as it is only done for the active columns of Φ. [sent-84, score-0.154]
25 Kim and Park [8] used a block pivoting algorithm, originally introduced by J´ dice and Pires [6], for solving the Lasso. [sent-85, score-0.132]
26 3 Previous Work Recent work has emphasized feature selection as an important problem in reinforcement learning [10, 11]. [sent-87, score-0.112]
27 An L1 regularized Bellman residual minimization algorithm was proposed by Loth et al. [sent-90, score-0.101]
28 This results in the following L1 regularized linear fixed point (L1 TD) problem: w = arg min x∈Rk 1 Φx − (R + γΦ π w) 2 2 2 + β x 1. [sent-101, score-0.109]
29 (3) Kolter and Ng derive a set of necessary and sufficient conditions characterizing the above fixed point3 in terms of β, w, and a vector c of correlations between the features and the Bellman residual ˆ ˆ T π V − V . [sent-102, score-0.11]
30 , I = {i : wi = 0}), the fixed point optimality conditions can be summarized as follows: C1 . [sent-106, score-0.127]
31 Successive ¯ solutions decrease β and are computed in closed form by determining the point at which a feature ¯ must be added or removed in order to further decrease β without violating one of the fixed point requirements. [sent-114, score-0.143]
32 The fact that it traces an entire homotopy path can be quite helpful because it does not require committing to a particular value of β. [sent-119, score-0.285]
33 It is natural to employ LARS-TD in an iterative manner within the least squares policy iteration (LSPI) algorithm [17], as Kolter and Ng did. [sent-122, score-0.621]
34 When a new policy is selected in the policy iteration loop, LARS-TD must discard its solution from the previous policy and start an entirely new homotopy path, making the value of the homotopy path in this context not entirely clear. [sent-124, score-2.198]
35 One might cross-validate a choice of regularization parameter by measuring the performance of the final policy, but this requires guessing a value of β for all policies and then running LARS-TD up to this value for each policy. [sent-125, score-0.126]
36 4 The L1 Regularized Fixed Point as an LCP We show that the optimality conditions for the L1 TD fixed point correspond to the solution of a (B)LCP. [sent-127, score-0.136]
37 The L1 regularized linear fixed point is described by a vector of correlations c as defined in equation (4). [sent-129, score-0.134]
38 Thus, any appropriate solver for the BLCP(A−1 b, A−1 , −β, β) can be thought of as a linear complementarity approach to solving for the L1 TD fixed point. [sent-136, score-0.244]
39 Proposition 1 If A is a P-matrix, then for any R, the L1 regularized linear fixed point exists, is unique, and will be found by a basic-set BLCP algorithm solving BLCP(A−1 b, A−1 , −β, β). [sent-138, score-0.109]
40 For example, the ability of many solvers to use a warm start during initialization offers a significant computational advantage over LARS-TD (which always begins with a null solution). [sent-145, score-0.384]
41 In the experimental section of this paper, we demonstrate that the ability to use warm starts during policy iteration can significantly improve computational efficiency. [sent-146, score-0.866]
42 5 Modified Policy Iteration using LARS-TD and LC-TD As mentioned in section 3, the advantages of LARS-TD as a homotopy method are less clear when it is used in a policy iteration loop since the homotopy path is traced only for specific policies. [sent-148, score-1.208]
43 It is possible to incorporate greedy policy improvements into the LARS-TD loop, leading to a homotopy path for greedy policies. [sent-149, score-0.97]
44 The greedy L1 regularized fixed point equation is: w = arg min x∈Rk 1 Φx − max(R + γΦ π w) π 2 2 2 + β x 1. [sent-150, score-0.207]
45 The current policy π is greedy with respect to the current solution. [sent-152, score-0.588]
46 It turns out that we can change policies and avoid violating the LARS-TD invariants if we make policy changes at points where applying the Bellman operator yields the same value for both the ˆ ˆ old policy (π) and the new policy (π ): T π V = T π V . [sent-153, score-1.597]
47 When the above the correlation of features with the residual T V − V equation is satisfied, the residual is equal for both policies. [sent-155, score-0.149]
48 When run to completion, LARQ provides a set of action-values that are the greedy fixed point for all settings of β. [sent-158, score-0.148]
49 In principle, this is more flexible than LARS-TD with policy iteration because it produces these results in a single run of the algorithm. [sent-159, score-0.623]
50 4 Even when A is not invertible, we can still use a BLCP solver as long as the principal submatrix of A associated with the active features is invertible. [sent-161, score-0.161]
51 LARS-TD enumerates every point at which the active set of features might change, a calculation that must be redone every time the active set changes. [sent-167, score-0.202]
52 LARQ must do this as well, but it must also enumerate all points at which the greedy policy can change. [sent-168, score-0.638]
53 For k features and n samples, LARS-TD must check O(k) points, but LARQ must check O(k + n) points. [sent-169, score-0.126]
54 Even though LARS-TD will run multiple times within a policy iteration loop, the number of such iterations will typically be far fewer than the number of training data points. [sent-170, score-0.623]
55 In practice, we have observed that LARQ runs several times slower than LARS-TD with policy iteration. [sent-171, score-0.491]
56 ” This occurs when the greedy policy for a particular β is not well defined. [sent-173, score-0.588]
57 In such cases, the algorithm attempts to switch to a new policy immediately following a policy change. [sent-174, score-0.982]
58 Looping is possible with most approximate policy iteration algorithms. [sent-176, score-0.601]
59 To address these limitations, we present a compromise between LARQ and LARS-TD with policy iteration. [sent-178, score-0.512]
60 It avoids the cost of continually checking for policy changes by updating the policy only at a fixed set of values, β (1) . [sent-180, score-0.982]
61 At each β (j) , the algorithm uses a policy iteration loop to (1) determine the current policy (greedy with respect to parameters w(j) ), and (2) compute an approximate value ˆ function Φw(j) using LC-TD. [sent-187, score-1.183]
62 The policy iteration loop terminates when w(j) ≈ w(j) or some ˆ predefined number of iterations is exceeded. [sent-188, score-0.713]
63 This use of LC-TD within a policy iteration loop will typically be quite fast because we can use the current feature set as a warm start. [sent-189, score-0.942]
64 The warm start is indicated in Algorithm 1 by supp(w(j) ), where the function supp determines the support, or active ˆ elements, in w(j) ; many (B)LCP solvers can use this information for initialization. [sent-190, score-0.447]
65 ˆ Once the policy iteration loop terminates for point β (j) , LC-MPI simply begins at the next point β (j+1) by initializing the weights with the previous solution, w(j+1) ← w(j) . [sent-191, score-0.771]
66 As an alternative, we tested initializing w(j+1) with the result of ˆ running LARS-TD with the greedy policy implicit in w(j) from the point (β (j) , w(j) ) to β (j+1) . [sent-193, score-0.638]
67 We can view LC-MPI as approximating LARQ’s homotopy path since the two algorithms agree for any β (j) reachable by LARQ. [sent-195, score-0.285]
68 By compromising between the greedy updates of LARQ and the pure policy evaluation methods of LARS-TD and LC-TD, LC-MPI can be thought of as form of modified policy iteration [20]. [sent-197, score-1.229]
69 First, we used both LARS-TD and LC-TD within policy iteration. [sent-200, score-0.491]
70 These experiments, which were run using a single value of the L1 regularization parameter, show the benefit of warm starts for LC-TD. [sent-201, score-0.346]
71 A single run of LC-MPI results in greedy policies for multiple values of β, allowing the use of crossvalidation to pick the best policy. [sent-203, score-0.165]
72 We show this is significantly more efficient than running policy iteration with either LARS-TD or LC-TD multiple times for different values of β. [sent-204, score-0.622]
73 Both types of experiments were conducted on the 20-state chain [17] and mountain car [21] domains, the same problems tested by Kolter and Ng [1]. [sent-206, score-0.146]
74 , m}, and β (m) ≥ 0 j=1 i=1 ∈ R+ and T ∈ N, termination conditions for policy iteration Initialization: Φ ← [ϕ(s1 , a1 ) . [sent-214, score-0.654]
75 rn ]T , w(1) ← 0 for j = 2 to m do // Initialize with the previous solution w(j) ← w(j−1) ˆ // Policy iteration loop Loop: // Select greedy actions and form Φ ∀i : ai ← arg maxa ϕ(si , a)T w(i) ˆ T Φ ← [ϕ(s1 , a1 ) . [sent-220, score-0.358]
76 ϕ(sn , an )] // Solve the LC-TD problem using a (B)LCP solver with a warm start w(j) ← LC-TD(Φ, Φ , R, γ, β (j) ) with warm start supp(w(j) ) ˆ // Check for termination if ( w(j) − w(j) 2 ≤ ) or (# iterations ≥ T ) ˆ then break loop else w(j) ← w(j) ˆ Return {w(j) }m j=1 by building up momentum. [sent-223, score-0.69]
77 1 Policy Iteration To compare LARS-TD and LC-TD when employed within policy iteration, we recorded the number of steps used during each round of policy iteration, where a step corresponds to a change in the active feature set. [sent-229, score-1.141]
78 The computational complexity per step of each algorithm is similar; therefore, we used the average number of steps per policy as a metric for comparing the algorithms. [sent-230, score-0.525]
79 Policy iteration was run either until the solution converged or 15 rounds were exceeded. [sent-231, score-0.167]
80 The two algorithms performed similarly for the chain MDP, but LC-TD used significantly fewer steps for the mountain car MDP. [sent-234, score-0.18]
81 Figure 1 shows plots for the number of steps used for each round of policy iteration for a single (typical) trial. [sent-235, score-0.679]
82 Notice the declining trend for LC-TD; this is due to the warm starts requiring fewer steps to find a solution. [sent-236, score-0.299]
83 The plot for the chain MDP shows that LC-TD uses many more steps in the first round of policy iteration than does LARSTD. [sent-237, score-0.718]
84 Lastly, in the trials shown in Figure 1, policy iteration using LC-TD converged in six iterations whereas it did not converge at all when using LARS-TD. [sent-238, score-0.601]
85 2 LC-MPI When LARS-TD and LC-TD are used as subroutines within policy iteration, the process ends at a single value of the L1 regularization parameter β. [sent-243, score-0.55]
86 The policy iteration loop must be rerun to consider different values of β. [sent-244, score-0.717]
87 In this section, we show how much computation can be saved by running LC-MPI once (to produce m greedy policies, each at a different value of β) versus running policy iteration m separate times. [sent-245, score-0.74]
88 The third column in Table 1 shows the average number of algorithm steps per policy for LC-MPI. [sent-246, score-0.525]
89 For LC-TD, note the decrease in steps due to warm starts. [sent-249, score-0.26]
90 7 Conclusions In this paper, we proposed formulating the L1 regularized linear fixed point problem as a linear complementarity problem. [sent-253, score-0.288]
91 Furthermore, we demonstrated that the “warm start” ability of LCP solvers can accelerate the computation of the L1 TD fixed point when initialized with the support set of a related problem. [sent-255, score-0.123]
92 This was found to be particularly effective for policy iteration problems when the set of active features does not change significantly from one policy to the next. [sent-256, score-1.183]
93 The difference between these algorithms is that LARQ incorporates greedy policy improvements inside the homotopy path. [sent-258, score-0.819]
94 The advantage of this “greedy” homotopy path is that it provides a set of action-values that are a greedy fixed point for all settings of the L1 regularization parameter. [sent-259, score-0.47]
95 The key to making LC-MPI efficient is the use of warm starts by using an LCP algorithm. [sent-262, score-0.265]
96 An interesting question is whether there is a natural way to incorporate policy improvement directly within the LCP formulation. [sent-264, score-0.491]
97 Feature selection using regularization in approximate linear programs for Markov decision processes. [sent-286, score-0.126]
98 A block principal pivoting algorithm for large-scale strictly monotone u linear complementarity problems. [sent-304, score-0.25]
99 An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. [sent-334, score-0.137]
100 Modified policy iteration algorithms for discounted Markov decision problems. [sent-386, score-0.62]
wordName wordTfidf (topN-words)
[('policy', 0.491), ('lcp', 0.456), ('blcp', 0.334), ('larq', 0.298), ('homotopy', 0.231), ('warm', 0.226), ('complementarity', 0.133), ('td', 0.118), ('kolter', 0.111), ('iteration', 0.11), ('lars', 0.106), ('greedy', 0.097), ('solvers', 0.094), ('loop', 0.091), ('pivoting', 0.07), ('dice', 0.062), ('regularization', 0.059), ('mountain', 0.058), ('active', 0.057), ('regularized', 0.056), ('path', 0.054), ('car', 0.049), ('ng', 0.048), ('solver', 0.047), ('loth', 0.046), ('johns', 0.046), ('parr', 0.046), ('bellman', 0.046), ('policies', 0.046), ('residual', 0.045), ('round', 0.044), ('mdp', 0.043), ('petrik', 0.042), ('invariants', 0.042), ('reinforcement', 0.041), ('optimality', 0.041), ('thought', 0.04), ('starts', 0.039), ('start', 0.039), ('chain', 0.039), ('appendix', 0.036), ('lstd', 0.036), ('violating', 0.036), ('lcps', 0.035), ('pires', 0.035), ('solution', 0.035), ('steps', 0.034), ('features', 0.034), ('xed', 0.033), ('reward', 0.032), ('supp', 0.031), ('uniqueness', 0.031), ('conditions', 0.031), ('ci', 0.031), ('kim', 0.031), ('rbfs', 0.031), ('point', 0.029), ('presentation', 0.027), ('mahadevan', 0.027), ('lc', 0.027), ('farahmand', 0.027), ('wi', 0.026), ('rk', 0.026), ('park', 0.026), ('columns', 0.025), ('bene', 0.025), ('offers', 0.025), ('actions', 0.025), ('equation', 0.025), ('must', 0.025), ('mdps', 0.025), ('lasso', 0.025), ('feature', 0.024), ('linear', 0.024), ('selection', 0.024), ('temporal', 0.024), ('emphasized', 0.023), ('permit', 0.023), ('rhs', 0.023), ('si', 0.023), ('principal', 0.023), ('formulating', 0.022), ('termination', 0.022), ('episodes', 0.022), ('li', 0.022), ('run', 0.022), ('unique', 0.022), ('ui', 0.022), ('check', 0.021), ('limitations', 0.021), ('terminates', 0.021), ('compromise', 0.021), ('running', 0.021), ('pursuit', 0.02), ('sgn', 0.02), ('transition', 0.02), ('guess', 0.02), ('squares', 0.02), ('decision', 0.019), ('radial', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
2 0.35325781 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
3 0.31986234 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
4 0.31654754 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
5 0.27997839 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
6 0.27023819 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
7 0.25278929 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
8 0.25089094 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
9 0.24637465 134 nips-2010-LSTD with Random Projections
10 0.18007798 43 nips-2010-Bootstrapping Apprenticeship Learning
11 0.15335679 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
12 0.14203064 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
13 0.13908429 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.13372165 212 nips-2010-Predictive State Temporal Difference Learning
15 0.12322611 269 nips-2010-Throttling Poisson Processes
16 0.12205918 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
17 0.11485062 64 nips-2010-Distributionally Robust Markov Decision Processes
18 0.10981479 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
19 0.10172562 4 nips-2010-A Computational Decision Theory for Interactive Assistants
20 0.087924376 229 nips-2010-Reward Design via Online Gradient Ascent
topicId topicWeight
[(0, 0.222), (1, -0.442), (2, -0.045), (3, -0.043), (4, 0.041), (5, -0.108), (6, 0.08), (7, 0.023), (8, 0.015), (9, -0.22), (10, 0.013), (11, -0.002), (12, -0.003), (13, 0.059), (14, -0.028), (15, -0.02), (16, 0.099), (17, -0.021), (18, -0.02), (19, -0.024), (20, 0.024), (21, 0.013), (22, 0.029), (23, 0.016), (24, -0.048), (25, -0.036), (26, 0.008), (27, -0.0), (28, -0.063), (29, 0.019), (30, -0.022), (31, -0.028), (32, -0.017), (33, -0.01), (34, 0.044), (35, -0.024), (36, -0.04), (37, 0.004), (38, -0.017), (39, 0.046), (40, 0.024), (41, -0.014), (42, -0.022), (43, -0.013), (44, -0.002), (45, 0.008), (46, 0.012), (47, -0.06), (48, -0.02), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.96467608 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
2 0.91851926 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
3 0.89432877 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
4 0.89167762 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
5 0.87550902 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
6 0.87378109 152 nips-2010-Learning from Logged Implicit Exploration Data
7 0.84802049 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
8 0.76746845 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
9 0.703677 134 nips-2010-LSTD with Random Projections
10 0.67815185 43 nips-2010-Bootstrapping Apprenticeship Learning
11 0.66102839 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
12 0.61426157 212 nips-2010-Predictive State Temporal Difference Learning
13 0.59689677 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.57049394 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
15 0.51929092 64 nips-2010-Distributionally Robust Markov Decision Processes
16 0.49129179 269 nips-2010-Throttling Poisson Processes
17 0.47102368 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
18 0.45413566 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
19 0.42519295 4 nips-2010-A Computational Decision Theory for Interactive Assistants
20 0.42351404 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
topicId topicWeight
[(13, 0.038), (17, 0.02), (27, 0.042), (30, 0.053), (35, 0.034), (42, 0.281), (45, 0.227), (50, 0.047), (51, 0.014), (52, 0.029), (60, 0.034), (77, 0.054), (90, 0.031)]
simIndex simValue paperId paperTitle
1 0.83807439 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection
Author: Prateek Jain, Raghu Meka, Inderjit S. Dhillon
Abstract: Minimizing the rank of a matrix subject to affine constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under affine constraints (ARMP) and show that SVP recovers the minimum rank solution for affine constraints that satisfy a restricted isometry property (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of lowrank matrix completion, for which the defining affine constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of [5, 18, 14], for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is significantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem. 1
same-paper 2 0.79112917 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
3 0.78471607 202 nips-2010-Parallelized Stochastic Gradient Descent
Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola
Abstract: With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8]. 1
4 0.66965324 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
5 0.66898829 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
Author: Sofia Mosci, Silvia Villa, Alessandro Verri, Lorenzo Rosasco
Abstract: We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in [12], where the group lasso penalty is generalized to overlapping groups of variables. While in [12] the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and projected Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing for dimensionality reduction. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations. 1
6 0.66889548 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
7 0.66879135 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
8 0.66847283 63 nips-2010-Distributed Dual Averaging In Networks
9 0.66656888 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.66647142 148 nips-2010-Learning Networks of Stochastic Differential Equations
11 0.66579235 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
12 0.66563219 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
13 0.66461408 290 nips-2010-t-logistic regression
14 0.6641466 287 nips-2010-Worst-Case Linear Discriminant Analysis
15 0.66388589 224 nips-2010-Regularized estimation of image statistics by Score Matching
16 0.6635021 282 nips-2010-Variable margin losses for classifier design
17 0.66337615 158 nips-2010-Learning via Gaussian Herding
18 0.66330504 181 nips-2010-Network Flow Algorithms for Structured Sparsity
19 0.66309834 257 nips-2010-Structured Determinantal Point Processes
20 0.66284293 155 nips-2010-Learning the context of a category