nips nips2011 nips2011-65 knowledge-graph by maker-knowledge-mining

65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Source: pdf

Author: Daniel J. Lizotte

Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efﬁcient algorithm for computing the coefﬁcients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. [sent-4, score-0.212]

2 We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. [sent-5, score-0.08]

3 To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. [sent-6, score-0.204]

4 We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efﬁcient algorithm for computing the coefﬁcients of ECOLS based on constraint generation. [sent-7, score-0.199]

5 We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. [sent-8, score-0.082]

6 1 Introduction Fitted value iteration (FVI), both in the model-based [4] and model-free [5, 15, 16, 17] settings, has become a method of choice for various applied batch reinforcement learning problems. [sent-9, score-0.196]

7 However, it is known that depending on the function approximation scheme used, ﬁtted value iteration can and does diverge in some settings. [sent-10, score-0.149]

8 This is particularly problematic—and easy to illustrate—when using linear regression as the function approximator. [sent-11, score-0.063]

9 Further interest in batch RL methods then led to work that uses non-parametric function approximators with FVI to avoid divergence [5, 15, 16, 17]. [sent-14, score-0.101]

10 This has left a gap in the “middle ground” of function approximator choices that guarantee convergence–we would like to have a function approximator that is more ﬂexible than the averagers but more easily interpreted than the non-parametric approximators. [sent-15, score-0.141]

11 In many scientiﬁc applications, linear regression is a natural choice because of its simplicity and interpretability when used with a small set of scientiﬁcally meaningful state features. [sent-16, score-0.088]

12 This enables scientists to interpret the parameters of an optimal learned value function as evidence for or against the importance of these features. [sent-18, score-0.056]

13 Thus for this work, we restrict our attention to linear function approximation, and ensure algorithmic convergence to a ﬁxed point regardless of the generative model of the data. [sent-19, score-0.078]

14 This is in contrast to previous work that explores how properties of the underlying MDP and properties of the function approximation space jointly inﬂuence convergence of the algorithm [1, 14, 6]. [sent-20, score-0.061]

15 Our aim is to develop a variant of linear regression that, when used in a ﬁtted value iteration algorithm, guarantees convergence of the algorithm to a ﬁxed point. [sent-21, score-0.18]

16 Our approach is to constrain the regression operator to be a non-expansion in the ∞-norm. [sent-23, score-0.168]

17 We show that the space of function approximators that satisfy this property is more 1 rich than the space of averagers [8], and we prove a minimax property on the residual error of the approximator. [sent-24, score-0.172]

18 2) We give an efﬁcient algorithm for computing the coefﬁcients of ECOLS based on quadratic programming with constraint generation. [sent-25, score-0.087]

19 3) We verify the algorithmic convergence of ﬁtted value iteration with ECOLS in a suite of experiments and discuss its performance. [sent-26, score-0.162]

20 Finally, we discuss future directions of research and comment on the general problem of learning an interpretable value function and policy from ﬁtted value iteration. [sent-27, score-0.15]

21 , |A|}, state transition matrices P (a) ∈ Rn×n for each action, a deterministic1 reward vector r ∈ Rn , and a discount factor γ < 1. [sent-34, score-0.082]

22 The “Bellman optimality” operator or “Dynamic Programming” operator T is given by (a) (T v)i = ri + max γPi,: v . [sent-36, score-0.232]

23 From v ∗ we can recover a policy πi = ri + argmaxa γPi,: v ∗ that has v ∗ as its value function. [sent-38, score-0.146]

24 An analogous operator K can be deﬁned for the state-action value function Q ∈ Rn×|A| . [sent-39, score-0.134]

25 (j) (KQ)i,j = ri + γPi,: max Q:,a a (2) The ﬁxed point of K is the optimal state-action value Q∗ which satisﬁes KQ∗ = Q∗ . [sent-40, score-0.06]

26 The value iteration algorithm proceeds by starting with an initial v or Q, and applying T or K repeatedly until convergence, which is guaranteed because both T and K are contraction mappings in the inﬁnity norm [8], as we discuss further below. [sent-41, score-0.126]

27 However K in particular is easily adapted to the case of a batch of n tuples of the form (si , ai , ri , si ) obtained by interaction with the system [5, 15, 16, 17]. [sent-43, score-0.098]

28 In this case, Q is only evaluated at states in our data set, and in MDPs with continuous state, the number of tuples n is analogous from a computational point of view to the size of our state space. [sent-44, score-0.077]

29 Fitted value iteration [5, 15, 16, 17] (FVI) interleaves either T or K above with a function approximation operator M . [sent-45, score-0.206]

30 For example in the model-based case, the composed operator (M ◦ T ) is applied repeatedly to an initial guess v 0 . [sent-46, score-0.102]

31 FVI has become increasingly popular especially in the ﬁeld of “batch-mode Reinforcement Learning” [13, 7] where a policy is learned from a ﬁxed batch of data that was collected by a prior agent. [sent-47, score-0.154]

32 The main advantage of ﬁtted value iteration is that the computation of (M ◦ T ) can be much lower than n in cases where the approximator M only requires computation of elements of (T v)i for a small subset of the state space. [sent-51, score-0.143]

33 For example, if M were linear regression and a particular state feature had a positive coefﬁcient in the learned value function, we know that larger values of that state feature are preferable. [sent-56, score-0.169]

34 Linear models are of importance because of their ease of interpretation, but unfortunately, ordinary least squares (OLS) function approximation can cause the successive iterations of FVI to fail to converge. [sent-57, score-0.112]

35 We now examine properties of the approximation operator M that control the algorithmic convergence of FVI. [sent-58, score-0.185]

36 3 Non-Expansions and Operator Norms We say M is a linear operator if M y + M y = M (y + y ) ∀y, y ∈ Rp and M 0 = 0. [sent-59, score-0.121]

37 Any linear operator can be represented by a p × p matrix of real numbers. [sent-60, score-0.147]

38 1 A noisy reward signal does not alter the analyses that follow, nor does dependence of the reward on action. [sent-61, score-0.07]

39 2 By deﬁnition, an operator M is a γ-contraction in the q-norm if ∃γ ≤ 1 s. [sent-62, score-0.102]

40 The operator norm of M induced by the q-norm can be deﬁned in several ways, including ||M y||q . [sent-66, score-0.151]

41 A linear operator M is a γ-contraction in the q-norm if and only if ||M ||op(q) ≤ γ. [sent-68, score-0.121]

42 Lemma 1 implies that a linear operator M is a non-expansion in the ∞-norm only if ||M ||op(∞) ≤ 1 (11) which is equivalent [18] to: |mij | ≤ 1 max i (12) j Corollary 1. [sent-74, score-0.121]

43 The set of all linear operators that satisfy (12) is exactly the set of linear operators that are non-expansions in the ∞-norm. [sent-75, score-0.122]

44 One subset of operators on Rp that are guaranteed to be non-expansions in the ∞-norm are the averagers2 , as deﬁned by Gordon [8]. [sent-76, score-0.061]

45 The set of all linear operators that satisfy (12) is larger than the set of averagers. [sent-78, score-0.061]

46 For M to be an averager, it must satisfy mij ≥ 0 ∀i, j mij ≤ 1. [sent-80, score-0.062]

47 max i (13) (14) j These constraints are stricter than (12), because they impose an additional non-negativity constraint on the elements of M . [sent-81, score-0.07]

48 It is well-known [8] that if such an M is used as a function approximator in ﬁtted value iteration, the algorithm is guaranteed to converge from any starting point because the composition M ◦ T is a γ-contraction in the ∞-norm. [sent-83, score-0.089]

49 2 The original deﬁnition of an averager was an operator of the form y → Ay + b for a constant vector b. [sent-84, score-0.167]

50 Suppose X is an n × p design matrix with n > p and rank(X) = p, and suppose y is a vector of ˆ regression targets. [sent-87, score-0.07]

51 Note that H is a projection of y onto the column space of X, and has 1 as an eigenvalue with multiplicity rank(X), and 0 as an eigenvalue with multiplicity (n−rank(X)). [sent-94, score-0.082]

52 It is known [18] that for a linear operator M , ||M ||op(2) is given by the largest singular value of M . [sent-95, score-0.153]

53 The ∞-norm expansion property of H is problematic when using linear function approximation for ﬁtted value iteration, as we described earlier. [sent-98, score-0.119]

54 If one wants to use linear regression safely within a value-iteration algorithm, it is natural to consider constraining the least-squares problem so that the resulting hat matrix is an ∞-norm non-expansion. [sent-99, score-0.189]

55 The symmetric matrix W is of size p × p, so we have a quadratic objective with a convex norm ¯ ¯ constraint on XW X T , resulting in a hat matrix H = X W X T . [sent-103, score-0.238]

56 However, unlike the OLS hat matrix H = ¯ X(X T X)−1 X T , the matrix H depends on the targets y. [sent-106, score-0.169]

57 Because of the non-linearity, the operator Hy resulting from the minimization in ¯ operator as H (18) can in fact be an expansion in the ∞-norm despite the constraints. [sent-109, score-0.223]

58 We now show how we might remove the dependence on y from (18) so that the resulting operator is a linear non-expansion in the op(∞)-norm. [sent-110, score-0.121]

59 ||XW X T ||op(∞) ≤ 1, ||z||2 = c, W ∈ Rp×p , W = W T , z ∈ Rn ˇ ˇ Intuitively, the resulting W is a linear operator of the form X W X T that minimizes the squared ˇ error between its approximation z and the worst-case (bounded) targets z. [sent-113, score-0.192]

60 The constraint not depend on the regression targets y, so the corresponding H ||XW X T ||op(∞) ≤ 1 is effectively a regularizer on the coefﬁcients of the hat matrix which will ˇ tend to shrink the ﬁtted values X W X T y toward zero. [sent-115, score-0.233]

61 By symmetry of W , write XW X T = U DU T where D is a diagonal matrix whose diagonal entries dii are the eigenvalues of XW X T and U is an orthonormal matrix. [sent-123, score-0.18]

62 We therefore have XW X T − I = U DU T − I = U DU T − U IU T = U (D − I)U T (22) Therefore ||XW X T − I||op(2) = maxi |dii − 1|, which is the largest singular value of XW X T − I. [sent-124, score-0.062]

63 Furthermore we know that rank(XW X T ) ≤ p and that therefore at least n − p of the dii are zero. [sent-125, score-0.136]

64 For any symmetric positive deﬁnite matrix W that satisﬁes the constraints in (19) and any n × p design matrix X s. [sent-128, score-0.076]

65 We know H is positive semideﬁnite because W is assumed to be positive semi-deﬁnite; therefore dii ≥ 0. [sent-133, score-0.136]

66 Because rank(XW X T ) ≤ p, we know that there exists an i such that dii = 0, and because we have shown that 0 ≤ dii ≤ 1, it follows that maxi |dii − 1| = 1, and therefore ||XW X T − I||op(2) = 1. [sent-140, score-0.302]

67 We therefore solve the following optimization problem, which has a unique solution, shows good empirical performance, and yet still provides the minimax property guaranteed by Theorem 1 when the optimal matrix is positive semi-deﬁnite. [sent-143, score-0.075]

68 ||XW X T ||op(∞) ≤ 1, ||z||2 = c, W ∈ Rp×p , W = W T , z ∈ Rn ˜ ˜ Intuitively, this objective searches for a W such that linear approximation using X W T X T is as close as possible to the OLS approximation, for the worst case regression targets, according to the 2-norm. [sent-146, score-0.109]

69 Therefore, we minimize the quadratic objective ||XW X T − H||F ij subject to the same convex constraints, which is easier to solve than (21). [sent-152, score-0.061]

70 The constraint ||XW X T ||op(∞) ≤ 1 can be expressed n T as the set of constraints j=1 |Xi,: W Xj,: | < 1, i = 1. [sent-158, score-0.07]

71 Each of these linear constraints involves a vector k with entries {+1, −1} multiplied by a row of XW X T . [sent-163, score-0.08]

72 We solve a sequence of quadratic programs, adding the most violated linear constraint after each n T step. [sent-168, score-0.086]

73 The most violated constraint is given by a row i∗ = argmaxi∈1. [sent-169, score-0.065]

74 Note that batch ﬁtted value iteration performs many regressions where the targets y change from iteration to iteration, but the design matrix X is ﬁxed. [sent-176, score-0.265]

75 In each of the RL settings, ECOLS with FVI converges, and the learned value function deﬁnes a good greedy policy. [sent-179, score-0.056]

76 The ECOLS regression optimizing the Frobenius i i norm using CPLEX [12] took 0. [sent-211, score-0.071]

77 Figure 1 shows the regression curves produced by OLS and the two versions of ECOLS, along with the learned coefﬁcients and root mean squared error of the predictions on the data. [sent-214, score-0.068]

78 We also ran ECOLS with ˜ an additional positivity constraint on X W X T , effectively forcing the result to be an averager as described in Sect. [sent-217, score-0.111]

79 Two-state example Our second example is a classic on-policy ﬁtted value iteration problem that is known to diverge using OLS. [sent-222, score-0.125]

80 The reward is R = [0, 0, 0]T and the value function is v ∗ = [0, 0, 0]T . [sent-226, score-0.067]

81 ˆ Grid world Our third example is an off-policy value iteration problem which is known to diverge with OLS, due to Boyan and Moore [4]. [sent-239, score-0.125]

82 Boyan and Moore deﬁne “lucky” convergence of FVI as the case where the policy induced by the learned value function is optimal, even if the learned value function itself does not accurately represent v ∗ . [sent-246, score-0.257]

83 They found that with OLS and a design matrix Xi,: = [1, xi , yi ], they achieve lucky convergence. [sent-247, score-0.089]

84 This value function induces a policy that attempts to increase x and y, which is optimal. [sent-252, score-0.118]

85 In terms of estimating the value of states, OLS achieves an RMSE over all states of 10413. [sent-258, score-0.058]

86 This is not “lucky”, as the induced policy is only optimal for states in the upper-right half of the state space. [sent-269, score-0.159]

87 Left-or-right world Our fourth and last example is an off-policy value iteration problem with stochastic dynamics where OLS causes non-divergent but non-convergent behavior. [sent-270, score-0.08]

88 In their formulation they use A ∈ {−2, 2}, which gives an optimal policy that is approximately π ∗ (s) = {2 if s > 2. [sent-276, score-0.086]

89 We used 300 episodes worth of data generated by the uniform random policy for learning. [sent-284, score-0.086]

90 The optimal policy induced by the Q-functions is determined solely by zeroes of Q(·, 4) − Q(·, −4), and in our experiments this function had at most one zero. [sent-286, score-0.108]

91 ’ FQI with ECOLS converged to a near-optimal policy π (s) = {4 if s > 1. [sent-290, score-0.086]

92 59, whereas the value ˜ of the optimal policy π ∗ is 60. [sent-293, score-0.118]

93 While the performance of the learned policy is very good, the estimate of the average value using the learned Qs, 28. [sent-295, score-0.166]

94 In this paper, we introduced ECOLS, which provides guaranteed convergence of FVI. [sent-298, score-0.056]

95 This is a further contribution of our paper: Our theoretical and empirical results indicate that this shrinkage is a necessary cost of guaranteeing convergence of FVI using linear models with a ﬁxed set of features. [sent-302, score-0.076]

96 In some applications where accurate estimates of policy performance are required, this shrinkage may be problematic; addressing this problem is an interesting avenue for future research. [sent-304, score-0.106]

97 In other applications where the goal is to identify a good, intuitively represented value function and policy ECOLS, is a useful new tool. [sent-305, score-0.118]

98 Generalization in reinforcement learning: Safely approximating the value function. [sent-332, score-0.104]

99 Neural ﬁtted Q iteration-ﬁrst experiences with a data efﬁcient neural reinforcement learning method. [sent-401, score-0.072]

100 Informing sequential clinical decision-making through reinforcement learning : an empirical study. [sent-426, score-0.072]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ecols', 0.602), ('xw', 0.359), ('fvi', 0.358), ('op', 0.322), ('ols', 0.296), ('dii', 0.136), ('operator', 0.102), ('tted', 0.097), ('policy', 0.086), ('rp', 0.075), ('fitted', 0.074), ('reinforcement', 0.072), ('hat', 0.07), ('averager', 0.065), ('averagers', 0.065), ('boyan', 0.065), ('approximators', 0.057), ('fqi', 0.057), ('ordinary', 0.052), ('iteration', 0.048), ('targets', 0.047), ('constraint', 0.046), ('diverge', 0.045), ('batch', 0.044), ('regression', 0.044), ('lucky', 0.043), ('operators', 0.042), ('approximator', 0.038), ('rl', 0.037), ('cvx', 0.037), ('convergence', 0.037), ('squares', 0.036), ('reward', 0.035), ('moore', 0.034), ('szepesv', 0.033), ('value', 0.032), ('mij', 0.031), ('maxi', 0.03), ('minimax', 0.03), ('safely', 0.03), ('argmin', 0.03), ('cients', 0.029), ('ri', 0.028), ('norm', 0.027), ('states', 0.026), ('tuples', 0.026), ('waterloo', 0.026), ('lizotte', 0.026), ('hy', 0.026), ('iter', 0.026), ('kq', 0.026), ('coef', 0.026), ('matrix', 0.026), ('frobenius', 0.025), ('problematic', 0.025), ('state', 0.025), ('disciplined', 0.025), ('geurts', 0.025), ('cplex', 0.025), ('constraints', 0.024), ('approximation', 0.024), ('du', 0.024), ('learned', 0.024), ('lemma', 0.024), ('suite', 0.023), ('multiplicity', 0.023), ('mdp', 0.023), ('objective', 0.022), ('constrain', 0.022), ('discount', 0.022), ('ernst', 0.022), ('tsitsiklis', 0.022), ('diverges', 0.022), ('algorithmic', 0.022), ('induced', 0.022), ('scienti', 0.022), ('quadratic', 0.021), ('residual', 0.02), ('satis', 0.02), ('programming', 0.02), ('qp', 0.02), ('regressions', 0.02), ('rank', 0.02), ('xi', 0.02), ('shrinkage', 0.02), ('guaranteed', 0.019), ('row', 0.019), ('linear', 0.019), ('bellman', 0.019), ('gordon', 0.019), ('smoother', 0.019), ('expansion', 0.019), ('converges', 0.019), ('rmse', 0.018), ('rn', 0.018), ('eigenvalue', 0.018), ('entries', 0.018), ('ij', 0.018), ('signs', 0.018), ('pi', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Author: Daniel J. Lizotte

2 0.10719454 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

Author: Po-ling Loh, Martin J. Wainwright

Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difﬁcult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1

3 0.095986255 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

4 0.094939493 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

5 0.081750222 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

Author: Edouard Grave, Guillaume R. Obozinski, Francis R. Bach

Abstract: Using the 1 -norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1

6 0.074320845 69 nips-2011-Differentially Private M-Estimators

7 0.066955656 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

8 0.05920567 215 nips-2011-Policy Gradient Coagent Networks

9 0.058421064 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

10 0.056338031 260 nips-2011-Sparse Features for PCA-Like Linear Regression

11 0.054274991 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

12 0.054077093 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

13 0.051569048 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

14 0.051066242 258 nips-2011-Sparse Bayesian Multi-Task Learning

15 0.050507732 283 nips-2011-The Fixed Points of Off-Policy TD

16 0.050378285 265 nips-2011-Sparse recovery by thresholded non-negative least squares

17 0.045973346 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

18 0.044994175 198 nips-2011-On U-processes and clustering performance

19 0.044947643 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

20 0.044384032 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.125), (1, -0.075), (2, 0.01), (3, -0.001), (4, -0.152), (5, 0.061), (6, -0.003), (7, 0.03), (8, 0.006), (9, 0.035), (10, 0.026), (11, 0.006), (12, -0.029), (13, 0.016), (14, -0.002), (15, -0.036), (16, 0.053), (17, 0.027), (18, 0.043), (19, 0.01), (20, -0.015), (21, 0.02), (22, 0.014), (23, 0.039), (24, -0.049), (25, 0.013), (26, -0.026), (27, -0.045), (28, 0.046), (29, -0.008), (30, 0.036), (31, 0.069), (32, 0.061), (33, 0.083), (34, 0.015), (35, -0.004), (36, 0.05), (37, 0.043), (38, 0.018), (39, -0.005), (40, -0.178), (41, -0.008), (42, -0.038), (43, 0.012), (44, -0.024), (45, 0.008), (46, 0.009), (47, 0.007), (48, -0.052), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90358913 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Author: Daniel J. Lizotte

2 0.61841094 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

3 0.60545516 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

4 0.58949912 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We ﬁrst prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

5 0.58125758 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

Author: Oliver B. Kroemer, Jan R. Peters

Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a uniﬁed view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Diﬀerence learning, Kernelized Temporal Diﬀerence learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1

6 0.57566947 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

7 0.56570911 69 nips-2011-Differentially Private M-Estimators

8 0.55295551 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

9 0.48824531 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

10 0.48107961 215 nips-2011-Policy Gradient Coagent Networks

11 0.4683305 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

12 0.45901296 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

13 0.45676365 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

14 0.44571003 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

15 0.43327487 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

16 0.43226093 73 nips-2011-Divide-and-Conquer Matrix Factorization

17 0.42099237 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

18 0.41881704 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

19 0.41828877 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

20 0.4134376 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (4, 0.045), (11, 0.024), (20, 0.04), (26, 0.063), (31, 0.073), (33, 0.011), (43, 0.061), (45, 0.115), (53, 0.01), (57, 0.03), (59, 0.227), (65, 0.015), (74, 0.06), (83, 0.046), (99, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7739144 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Author: Daniel J. Lizotte

2 0.76570457 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is signiﬁcantly better than all others. We propose a new way of setting the learning rate, which adapts to the difﬁculty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

3 0.72450501 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-speciﬁc body part and object interactions bypasses the difﬁcult problem of estimating the complete human body pose conﬁguration. Beneﬁts of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

4 0.6378879 22 nips-2011-Active Ranking using Pairwise Comparisons

Author: Kevin G. Jamieson, Robert Nowak

Abstract: This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of n objects can be identiﬁed by standard sorting methods using n log2 n pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Speciﬁcally, we assume that the objects can be embedded into a d-dimensional Euclidean space and that the rankings reﬂect their relative distances from a common reference point in Rd . We show that under this assumption the number of possible rankings grows like n2d and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than d log n adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis. 1

5 0.63279712 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

Author: Feng Yan, Yuan Qi

Abstract: For many real-world applications, we often need to select correlated variables— such as genetic variations and imaging features associated with Alzheimer’s disease—in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Speciﬁcally, it integrates a sparse conditional classiﬁcation model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efﬁcient active-set algorithm to estimate the model via evidence maximization. Experimental results on synthetic data and imaging genetics data demonstrate the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination. 1

6 0.62889081 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

7 0.62810546 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.62775511 29 nips-2011-Algorithms and hardness results for parallel large margin learning

9 0.62721676 186 nips-2011-Noise Thresholds for Spectral Clustering

10 0.62584317 231 nips-2011-Randomized Algorithms for Comparison-based Search

11 0.6226685 80 nips-2011-Efficient Online Learning via Randomized Rounding

12 0.62252843 258 nips-2011-Sparse Bayesian Multi-Task Learning

13 0.6216619 73 nips-2011-Divide-and-Conquer Matrix Factorization

14 0.62065303 271 nips-2011-Statistical Tests for Optimization Efficiency

15 0.62009943 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

16 0.61929947 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

17 0.61858416 21 nips-2011-Active Learning with a Drifting Distribution

18 0.61807013 265 nips-2011-Sparse recovery by thresholded non-negative least squares

19 0.61777121 263 nips-2011-Sparse Manifold Clustering and Embedding

20 0.61688679 276 nips-2011-Structured sparse coding via lateral inhibition