nips nips2012 nips2012-292 knowledge-graph by maker-knowledge-mining

292 nips-2012-Regularized Off-Policy TD-Learning


Source: pdf

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. [sent-5, score-0.206]

2 The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. [sent-6, score-0.406]

3 A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. [sent-8, score-0.085]

4 1 Introduction Temporal-difference (TD) learning is a widely used method in reinforcement learning (RL). [sent-9, score-0.073]

5 Although TD converges when samples are drawn “on-policy” by sampling from the Markov chain underlying a policy in a Markov decision process (MDP), it can be shown to be divergent when samples are drawn “off-policy”. [sent-10, score-0.08]

6 [20] introduced convergent off-policy temporal difference learning algorithms, such as TDC, whose computation time scales linearly with the number of samples and the number of features. [sent-13, score-0.161]

7 Regularizing reinforcement learning algorithms leads to more robust methods that can scale up to large problems with many potentially irrelevant features. [sent-15, score-0.073]

8 LARS-TD [7] introduced a popular approach of combining l1 regularization using Least Angle Regression (LARS) with the least-squares TD (LSTD) framework. [sent-16, score-0.035]

9 Another approach was introduced in [5] (LCP-TD) based on the Linear Complementary Problem (LCP) formulation, an optimization approach between linear programming and quadratic programming. [sent-17, score-0.047]

10 A theoretical analysis of l1 regularization was given in [4], including error bound analysis with finite samples in the on-policy setting. [sent-19, score-0.035]

11 An approximate linear programming approach for finding l1 regularized solutions of the Bellman equation was presented in [17]. [sent-21, score-0.193]

12 Another approach to feature selection is to greedily add new features, proposed recently in [15]. [sent-23, score-0.063]

13 Regularized first-order reinforcement learning approaches have recently been investigated in the on-policy setting as well, wherein convergence of l1 regularized temporal difference learning is discussed in [16] and mirror descent [6] is used in [11]. [sent-24, score-0.419]

14 1 In this paper, the off-policy TD learning problem is formulated from the stochastic optimization perspective. [sent-25, score-0.061]

15 A novel objective function is proposed based on the linear equation formulation of the TDC algorithm. [sent-26, score-0.186]

16 The optimization problem underlying off-policy TD methods, such as TDC, is reformulated as a convex-concave saddle-point stochastic approximation problem, which is both convex and incrementally solvable. [sent-27, score-0.114]

17 Section 2 reviews the basics of MDPs, RL and recent work on off-policy convergent TD methods, such as the TDC algorithm. [sent-30, score-0.115]

18 Section 3 introduces the proximal gradient method and the convex-concave saddle-point formulation of non-smooth convex optimization. [sent-31, score-0.249]

19 A policy π : S → A is a deterministic mapping from states to actions. [sent-36, score-0.101]

20 In linear value function approximation, a value function is assumed to lie in the linear span of a basis function matrix Φ of dimension |S| × d, where d is the number of linear independent features. [sent-39, score-0.12]

21 TD with gradient correction (TDC) [20] aims to minimize the mean-square projected Bellman error (MSPBE) in order to guarantee off-policy convergence. [sent-45, score-0.113]

22 t 3 Proximal Gradient and Saddle-Point First-Order Algorithms We now introduce some background material from convex optimization. [sent-47, score-0.09]

23 The proximal mapping associated with a convex function h is defined as:1 proxh (x) = arg min(h(u) + u 1 1 2 u−x ) 2 The proximal mapping can be shown to be the resolvent of the subdifferential of the function h. [sent-48, score-0.29]

24 • For the t-th sample, φt (the t-th row of Φ), φt (the t-th row of Φ ) are the feature vectors corresponding to st , st , respectively. [sent-54, score-0.131]

25 θt is the coefficient vector for t-th sample in firstorder TD learning methods, and δt = (rt + γφtT θt ) − φT θt is the temporal difference t error. [sent-55, score-0.046]

26 Also, xt = [wt ; θt ], αt is a stepsize, βt = ηαt , η > 0. [sent-56, score-0.202]

27 • ρ is l1 regularization parameter, λ is the eligibility trace factor, N is the sample size, d is the number of basis functions, p is the number of active basis functions. [sent-59, score-0.253]

28 With this background, we now introduce the proximal gradient method. [sent-62, score-0.142]

29 1 Convex-concave Saddle-Point First Order Algorithms The key novel contribution of our paper is a convex-concave saddle-point formulation for regularized off-policy TD learning. [sent-65, score-0.134]

30 Let x ∈ X, y ∈ Y , where X, Y are both nonempty bounded closed convex sets, and f (x) : X → R be a convex function. [sent-67, score-0.128]

31 If f is non-smooth yet convex and well structured, which is not suitable for many existing optimization approaches requiring smoothness, its saddle-point representation ϕ is often smooth and convex. [sent-70, score-0.087]

32 A comprehensive overview on extending convex minimization to convex-concave saddle-point problems with unified variational inequalities is presented in [1]. [sent-72, score-0.064]

33 As an example, consider f (x) = ||Ax − b||m which admits a bilinear minimax representation f (x) := Ax − b m = max y T (Ax − b) (8) y n ≤1 where m, n are conjugate numbers. [sent-73, score-0.07]

34 3 4 Regularized Off-policy Convergent TD-Learning We now describe a novel algorithm, regularized off-policy convergent TD-learning (RO-TD), which combines off-policy convergence and scalability to large feature spaces. [sent-75, score-0.264]

35 The objective function is proposed based on the linear equation formulation of the TDC algorithm. [sent-76, score-0.186]

36 Then the objective function is represented via its dual minimax problem. [sent-77, score-0.041]

37 The RO-TD algorithm is proposed based on the primal-dual subgradient saddle-point algorithm, and inspired by related methods in [12],[13]. [sent-78, score-0.058]

38 1 Objective Function of Off-policy TD Learning In this subsection, we describe the objective function of the regularized off-policy RL problem. [sent-80, score-0.132]

39 The first concern is that the objective function should be convex and stochastically solvable. [sent-82, score-0.173]

40 Note that A, At are neither PSD nor symmetric, and it is not straightforward to formulate a convex objective function based on them. [sent-83, score-0.105]

41 The second concern is that since we do not have knowledge of A, the objective function should be separable so that it is stochastically solvable based on At , bt . [sent-84, score-0.315]

42 The other concern regards the sampling condition in temporal difference learning: double-sampling. [sent-85, score-0.091]

43 As pointed out in [19], double-sampling is a necessary condition to obtain an unbiased estimator if the objective function is the Bellman residual or its derivatives (such as projected Bellman residual), wherein the product of Bellman error or projected Bellman error metrics are involved. [sent-86, score-0.145]

44 The second intuition is to use the online least squares formulation of the linear equation Ax = b. [sent-93, score-0.145]

45 1 However, since A is not symmetric and positive semi-definite (PSD), A 2 does not exist and thus 1 1 Ax = b cannot be reformulated as minx∈X ||A 2 x − A− 2 b||2 . [sent-94, score-0.027]

46 Another possible idea is to attempt 2 to find an objective function whose gradient is exactly At xt − bt and thus the regularized gradient is proxαt h(xt ) (At xt − bt ). [sent-95, score-1.062]

47 However, since At is not symmetric, this gradient does not explicitly correspond to any kind of optimization problem, not to mention a convex one2 . [sent-96, score-0.144]

48 2 RO-TD Algorithm Design In this section, the problem of (13) is formulated as a convex-concave saddle-point problem, and the RO-TD algorithm is proposed. [sent-98, score-0.038]

49 Analogous to (8), the regularized loss function can be formulated as Ax − b m + h(x) = max y T (Ax − b) + h(x) y n ≤1 (14) Similar to (9), Equation (14) can be solved via an iteration procedure as follows, where xt = [wt ; θt ]. [sent-99, score-0.331]

50 1 xt+ 2 = xt − αt AT yt t yt+ 1 = yt + αt (At xt − bt ) 2 , xt+1 = proxαt h (xt+ 1 ) , 2 yt+1 = Πn (yt+ 1 ) 2 (15) 2 Note that the A matrix in GTD2’s linear equation representation is symmetric, yet is not PSD, so it cannot be formulated as a convex problem. [sent-100, score-1.292]

51 However, the computation cost is actually O(N d) with a linear T algebraic trick by computing not At but yt At , At xt − bt . [sent-102, score-0.671]

52 Now we are ready to present the RO-TD algorithm: Algorithm 1 RO-TD Let π be some fixed policy of an MDP M , and let the sample set S = {si , ri , si }N . [sent-104, score-0.08]

53 1: repeat 2: Compute φt , φt and TD error δt = (rt + γφtT θt ) − φT θt t T 3: Compute yt At , At xt − bt in Equation (17) and (18). [sent-106, score-0.647]

54 First, the regularization term h(x) can be any kind of convex regularization, such as ridge regression or sparsity penalty ρ||x||1 . [sent-108, score-0.099]

55 Algorithm 2, RO-GQ(λ), extends the where eligibility traces et , and φ RO-TD algorithm to include eligibility traces. [sent-118, score-0.304]

56 1: repeat ¯ ¯ 2: Compute φt , φt+1 and TD error δt = (rt + γ φT θt ) − φT θt t t+1 T 3: Compute yt At , At xt − bt in Equation (23). [sent-121, score-0.647]

57 4: Compute xt+1 , yt+1 as in Equation (15) 5: Choose action at , and get st+1 6: Set t ← t + 1; 7: until st is an absorbing state; 8: Compute xt , yt as in Equation (16) ¯ ¯ 4. [sent-122, score-0.558]

58 Assumption 2 (Basis Function)[20]: Φ is a full column rank matrix, namely, Φ comprises a linear independent set of basis functions w. [sent-127, score-0.072]

59 Assumption 3 (Subgradient Boundedness)[12]: Assume for the bilinear convex-concave loss T function defined in (14), the sets X, Y are closed compact sets. [sent-134, score-0.046]

60 Then the subgradient yt At and At xt − bt in RO-TD algorithm are uniformly bounded, i. [sent-135, score-0.705]

61 , there exists a constant L such that T At xt − bt ≤ L, yt At ≤ L. [sent-137, score-0.647]

62 Proposition 1: The approximate saddle-point xt of RO-TD converges w. [sent-138, score-0.202]

63 1 to the global minimizer ¯ of the following, x∗ = arg min Ax − b m + ρ x 1 (26) x∈X Proof Sketch: See the supplementary material for details. [sent-140, score-0.026]

64 LARS-TD [7], which is a popular second-order sparse reinforcement learning algorithm, is used as the baseline algorithm for feature selection and TDC is used as the off-policy convergent RL baseline algorithm, respectively. [sent-142, score-0.251]

65 02 0 20 40 60 80 100 120 140 160 180 200 Sweeps Sweeps Figure 2: Illustrative examples of the convergence of RO-TD using the Star and Random-walk MDPs. [sent-154, score-0.029]

66 1 MSPBE Minimization and Off-Policy Convergence This experiment aims to show the minimization of MSPBE and off-policy convergence of the ROTD algorithm. [sent-156, score-0.053]

67 The 7 state star MDP is a well known counterexample where TD diverges monotonically and TDC converges. [sent-157, score-0.116]

68 Because of this, the star MDP is unsuitable for LSTD-based algorithms, including LARS-TD since ΦT R = 0 always holds. [sent-161, score-0.052]

69 The random-walk problem is a standard Markov chain with 5 states and two absorbing state at two ends. [sent-162, score-0.08]

70 Three sets of different bases Φ are used in [20], which are tabular features, inverted features and dependent features respectively. [sent-163, score-0.138]

71 The regularization term h(x) is set to 0 to make a fair comparison with TD and TDC. [sent-165, score-0.035]

72 The middle subfigure shows the value of yt (Axt − b) and Axt − b 2 , wherein T Axt − b 2 is always greater than the value of yt (Axt − b). [sent-169, score-0.538]

73 As the result shows, TDC and RO-TD perform equally well, which illustrates the off-policy convergence of the RO-TD algorithm. [sent-171, score-0.029]

74 2 Feature Selection In this section, we use the mountain car example with a variety of bases to show the feature selection capability of RO-TD. [sent-177, score-0.171]

75 The Mountain car MDPis an optimal control problem with a continuous twodimensional state space. [sent-178, score-0.051]

76 The steep discontinuity in the value function makes learning difficult for bases with global support. [sent-179, score-0.033]

77 To make a fair comparison, we use the same basis function setting as in [7], where two dimensional grids of 2, 4, 8, 16, 32 RBFs are used so that there are totally 1365 basis functions. [sent-180, score-0.096]

78 For RO-TD and TDC, 3000 samples are used by executing 15 episodes with 200 steps for each episode, stepsize αt = 0. [sent-182, score-0.093]

79 As the result shows in Table 1, RO-TD is able to perform feature selection successfully, whereas TDC and TD failed. [sent-187, score-0.063]

80 3 High-dimensional Under-actuated Systems The triple-link inverted pendulum [18] is a highly nonlinear under-actuated system with 8dimensional state space and discrete action space. [sent-203, score-0.195]

81 The state space consists of the angles and angular velocity of each arm as well as the position and velocity of the car. [sent-204, score-0.081]

82 The goal is to learn a policy that can balance the arms for Nx steps within some minimum number of learning episodes. [sent-206, score-0.08]

83 The pendulum initiates from zero equilibrium state and the first action is randomly chosen to push the pendulum away from initial state. [sent-208, score-0.201]

84 Fourier basis [8] with order 2 is used, resulting in 6561 basis functions. [sent-211, score-0.096]

85 To sum up, RO-GQ(λ) tends to outperform GQ(λ) in all aspects, and is able to outperform LARSTD based policy iteration in high dimensional domains, as well as in selected smaller MDPs where LARS-TD diverges (e. [sent-214, score-0.117]

86 It is worth noting that the computation cost of LARS-TD is O(N dp3 ), where that for RO-TD is O(N d). [sent-217, score-0.051]

87 We also find that tuning the sparsity parameter ρ2 generates an interpolation between GQ(λ) and TD learning, where a large ρ2 helps eliminate the correction term of TDC update and make the update direction more similar to the TD update. [sent-222, score-0.034]

88 7 Conclusions This paper presents a novel unified framework for designing regularized off-policy convergent RL algorithms combining a convex-concave saddle-point problem formulation for RL with stochastic first-order methods. [sent-223, score-0.249]

89 A detailed experimental analysis reveals that the proposed RO-TD algorithm is both off-policy convergent and is robust to noisy features. [sent-224, score-0.115]

90 One direction for future work is to extend the subgradient saddlepoint solver to a more generalized mirror descent framework. [sent-226, score-0.178]

91 Mirror descent is a generalization of subgradient descent with non-Euclidean distance [1], and has many advantages over gradient descent in high-dimensional spaces. [sent-227, score-0.265]

92 In [6], two algorithms to solve the bilinear saddle-point formulation are proposed based on mirror descent and the extragradient [9], such as the Mirror-Prox algorithm. [sent-228, score-0.27]

93 Acknowledgments This material is based upon work supported by the Air Force Office of Scientific Research (AFOSR) under grant FA9550-10-1-0383, and the National Science Foundation under Grant Nos. [sent-231, score-0.026]

94 NSF CCF1025120, IIS-0534999, IIS-0803288, and IIS-1216467 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the AFOSR or the NSF. [sent-232, score-0.026]

95 Non-Euclidean restricted memory level method for large-scale convex optimization. [sent-239, score-0.064]

96 Regularization and feature selection in least-squares temporal difference learning. [sent-277, score-0.109]

97 Value function approximation in reinforcement learning using the fourier basis. [sent-282, score-0.095]

98 The extragradient method for finding saddle points and other problems. [sent-287, score-0.061]

99 GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. [sent-294, score-0.179]

100 Feature selection using regularization in approximate linear programs for Markov decision processes. [sent-334, score-0.093]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tdc', 0.529), ('td', 0.347), ('mspbe', 0.322), ('yt', 0.239), ('bt', 0.206), ('xt', 0.202), ('ax', 0.181), ('wt', 0.163), ('gq', 0.131), ('eligibility', 0.122), ('convergent', 0.115), ('bellman', 0.096), ('regularized', 0.091), ('lstd', 0.088), ('proximal', 0.085), ('axt', 0.084), ('policy', 0.08), ('equation', 0.078), ('rt', 0.074), ('reinforcement', 0.073), ('mirror', 0.07), ('pendulum', 0.07), ('mdp', 0.07), ('inverted', 0.064), ('convex', 0.064), ('prox', 0.062), ('extragradient', 0.061), ('rl', 0.06), ('wherein', 0.06), ('subgradient', 0.058), ('gradient', 0.057), ('proxh', 0.056), ('star', 0.052), ('st', 0.051), ('descent', 0.05), ('basis', 0.048), ('ut', 0.047), ('temporal', 0.046), ('bilinear', 0.046), ('maei', 0.046), ('pss', 0.046), ('rotd', 0.046), ('supy', 0.046), ('concern', 0.045), ('sweeps', 0.045), ('stepsize', 0.043), ('formulation', 0.043), ('psd', 0.042), ('objective', 0.041), ('tabular', 0.041), ('formulated', 0.038), ('comprised', 0.037), ('diverges', 0.037), ('zico', 0.037), ('sub', 0.036), ('regularization', 0.035), ('mahadevan', 0.035), ('et', 0.035), ('reward', 0.035), ('action', 0.034), ('correction', 0.034), ('mdps', 0.034), ('selection', 0.034), ('nx', 0.033), ('bases', 0.033), ('sutton', 0.033), ('absorbing', 0.032), ('dantzig', 0.031), ('selector', 0.031), ('rmax', 0.03), ('convergence', 0.029), ('mountain', 0.029), ('feature', 0.029), ('juditsky', 0.028), ('lazaric', 0.028), ('episodes', 0.027), ('state', 0.027), ('velocity', 0.027), ('reformulated', 0.027), ('material', 0.026), ('worth', 0.026), ('transition', 0.026), ('traces', 0.025), ('tt', 0.025), ('noting', 0.025), ('operator', 0.025), ('conjugate', 0.024), ('car', 0.024), ('experiment', 0.024), ('linear', 0.024), ('executing', 0.023), ('stochastically', 0.023), ('afosr', 0.023), ('optimization', 0.023), ('conference', 0.023), ('capability', 0.022), ('fourier', 0.022), ('projected', 0.022), ('states', 0.021), ('markov', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

2 0.26641354 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.2338025 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

4 0.18946961 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

5 0.1842228 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

6 0.18045515 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.14584194 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

8 0.14198744 293 nips-2012-Relax and Randomize : From Value to Algorithms

9 0.13967903 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

10 0.13740301 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

11 0.13463148 41 nips-2012-Ancestor Sampling for Particle Gibbs

12 0.11784292 344 nips-2012-Timely Object Recognition

13 0.11690348 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

14 0.11089411 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

15 0.10233539 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

16 0.099385396 303 nips-2012-Searching for objects driven by context

17 0.099261172 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

18 0.098774247 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

19 0.09745691 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

20 0.096690074 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, -0.177), (2, 0.169), (3, 0.221), (4, 0.124), (5, -0.071), (6, -0.016), (7, -0.108), (8, 0.07), (9, -0.011), (10, 0.021), (11, -0.065), (12, 0.027), (13, 0.101), (14, 0.065), (15, 0.004), (16, 0.024), (17, 0.002), (18, 0.067), (19, -0.037), (20, 0.02), (21, 0.03), (22, 0.005), (23, -0.073), (24, 0.027), (25, -0.086), (26, -0.033), (27, 0.074), (28, 0.037), (29, 0.02), (30, 0.039), (31, -0.035), (32, -0.015), (33, -0.006), (34, 0.026), (35, 0.006), (36, 0.02), (37, -0.03), (38, 0.002), (39, -0.009), (40, -0.033), (41, 0.036), (42, 0.019), (43, -0.11), (44, -0.001), (45, -0.062), (46, 0.006), (47, 0.077), (48, -0.078), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94434804 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

2 0.73148447 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.69665605 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

4 0.68413973 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

5 0.65140623 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

6 0.64938831 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

7 0.64723742 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

8 0.57246679 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

9 0.57168567 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

10 0.56823164 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

11 0.56109166 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

12 0.5550378 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

13 0.49222052 283 nips-2012-Putting Bayes to sleep

14 0.48471203 305 nips-2012-Selective Labeling via Error Bound Minimization

15 0.48406756 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

16 0.48396793 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

17 0.48379979 41 nips-2012-Ancestor Sampling for Particle Gibbs

18 0.47682369 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

19 0.47373518 161 nips-2012-Interpreting prediction markets: a stochastic approach

20 0.46425956 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (1, 0.011), (21, 0.03), (38, 0.137), (39, 0.012), (42, 0.055), (45, 0.191), (54, 0.075), (55, 0.014), (68, 0.021), (74, 0.043), (76, 0.102), (80, 0.108), (92, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89232385 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

same-paper 2 0.8161127 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

3 0.81306118 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

4 0.78047109 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.

5 0.77872169 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

6 0.74255133 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

7 0.74120533 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

8 0.73820281 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.73777384 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

10 0.73644531 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

11 0.73240942 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

12 0.73056406 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

13 0.72957546 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

14 0.72904384 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

15 0.72888529 65 nips-2012-Cardinality Restricted Boltzmann Machines

16 0.72852433 38 nips-2012-Algorithms for Learning Markov Field Policies

17 0.72826493 160 nips-2012-Imitation Learning by Coaching

18 0.7264173 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

19 0.72587007 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

20 0.72561598 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models