nips nips2012 nips2012-358 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amir M. Farahmand, Doina Precup
Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1
Reference: text
sentIndex sentText sentNum sentScore
1 VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. [sent-2, score-0.371]
2 Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. [sent-4, score-0.11]
3 This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. [sent-5, score-0.371]
4 1 Introduction One often has to use function approximation to represent the near optimal value function of the reinforcement learning (RL) and planning problems with large state spaces. [sent-7, score-0.15]
5 Even though the conventional approach of using a parametric model for the value function has had successes in many applications, it has one main weakness: Its success critically depends on whether the chosen function approximation method is suitable for the particular task in hand. [sent-8, score-0.067]
6 One class of approaches that addresses the aforementioned problem is based on the idea of finding a sparse representation of the value function in a large dictionary of features (or atoms). [sent-15, score-0.327]
7 The feature, therefore, is simply added to the dictionary with the hope that the algorithm itself figures out the necessary subset of features. [sent-17, score-0.262]
8 Another approach is based on greedily adding atoms to the representation of the target function. [sent-23, score-0.22]
9 Johns [11] empirically investigated some greedy algorithms, including OMP, for the task of feature selection using dictionary of proto-value functions [1]. [sent-26, score-0.328]
10 A recent paper by Painter-Wakefield and Parr [12] considers two algorithms (OMP-TD and OMP-BRM; OMP-TD is the same as one of the algorithms by [11]) in the context of policy evaluation and provides some conditions under which OMP-BRM can find the minimally sparse solution. [sent-27, score-0.129]
11 The first is that it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a set of features (dictionary), by using a modified version of OMP. [sent-36, score-0.128]
12 The second is that after each iteration, the VPI algorithm adds a set of functions based on the currently learned value function to the dictionary. [sent-37, score-0.071]
13 This potentially increases the representation power of the dictionary in a way that is directly relevant to the goal of approximating the optimal value function. [sent-38, score-0.328]
14 Using OMP allows VPI to find a sparse representation of the value function in large dictionaries, even countably infinite ones1 . [sent-40, score-0.104]
15 The second main feature of VPI is that it increases the size of the dictionary by adding some basis functions computed from previously learned value functions. [sent-43, score-0.329]
16 To give an intuitive understanding of how this might help, consider the dictionary B = {g1 , g2 , . [sent-44, score-0.243]
17 }, in which each atom gi is a realvalued function defined on the state-action space. [sent-47, score-0.075]
18 The goal is to learn the optimal value function by a representation in the form of Q = i≥1 wi gi . [sent-48, score-0.114]
19 2 Suppose that we are lucky and the optimal value function Q∗ belongs to the dictionary B, e. [sent-49, score-0.335]
20 This is indeed an ideal atom to have in the dictionary since one may have a sparse representation of the optimal value function in the form of Q∗ = i≥1 wi gi with w1 = 1 and wi = 0 for i ≥ 2. [sent-52, score-0.425]
21 Of course, we are not usually lucky enough to have the optimal value function in our dictionary, but we may still use approximation of the optimal value function. [sent-54, score-0.166]
22 This ensures that Qk and Qk+1 = T ∗ Qk are close enough, so one may use Qk to explain a large part of Qk+1 and use the other atoms of the dictionary to “explain” the residual. [sent-56, score-0.425]
23 In an AVI procedure, however, the estimated value function sequence (Qk )k≥1 does not necessarily converge to Q∗ , but one may hope that it gets close to a region around the optimum. [sent-57, score-0.064]
24 In that case, we may very well use the dictionary of {Q1 , . [sent-58, score-0.243]
25 , Qk } as the set of candidate atoms to be used in the representation of Qk+1 . [sent-61, score-0.22]
26 We show that adding these learned atoms does not hurt and may actually help. [sent-62, score-0.197]
27 To summarize, the algorithmic contribution of this paper is to introduce the VPI algorithm that finds a sparse representation of the optimal value function in a huge function space and increases the representation capacity of the dictionary problem-dependently. [sent-63, score-0.388]
28 B(Ω) denotes the space of bounded measurable functions w. [sent-69, score-0.08]
29 A finite-action discounted MDP is a 5-tuple (X , A, P, R, γ), where X is a measurable state space, A is a finite set of actions, P : X × A → M(X ) is the transition probability kernel, R : X × A → M(R) is the reward kernel, and γ ∈ [0, 1) is a discount factor. [sent-73, score-0.072]
30 A measurable mapping π : X → A is called a deterministic Markov stationary policy, or just a policy for short. [sent-75, score-0.189]
31 A policy π induces the m-step transition probability kernels (P π )m : X → M(X ) and (P π )m : X × A → M(X × A) for m ≥ 1. [sent-76, score-0.131]
32 We use V π and Qπ to denote the value and action-value function of a policy π. [sent-77, score-0.131]
33 We also use V ∗ and Q∗ for the optimal value and optimal action-value functions, with the corresponding optimal 1 2 From the statistical viewpoint and ignoring the computational difficulty of working with large dictionaries. [sent-78, score-0.093]
34 The Bellman optimality operator is denoted by T ∗ . [sent-86, score-0.072]
35 We use (P V )(x) to denote the expected value of V after the transition according to a probability transition kernel P . [sent-87, score-0.072]
36 The class L1 (B) = L1 (B; · ) consists of those functions f ∈ H that admits an expansion f = g∈B cg g with (cg ) being an absolutely summable sequence (these definitions are quoted from Barron et al. [sent-109, score-0.21]
37 The norm of a function f in this space is defined as f L1 (B; · ) inf{ g∈B |cg | : f = g∈B cg g}. [sent-111, score-0.092]
38 To avoid clutter, when the norm is the empirical norm · Dn , we may use L1 (B; Dn ) instead of L1 (B; · Dn ), and when the norm is · ν , we may use L1 (B; ν). [sent-112, score-0.107]
39 For a dictionary B, we introduce a fixed exhaustion B1 ⊂ B2 ⊂ . [sent-117, score-0.272]
40 3 VPI Algorithm In this section, we first describe the behaviour of VPI in the ideal situation when the Bellman optimality operator T ∗ can be applied exactly in order to provide the intuitive understanding of why VPI might work. [sent-135, score-0.111]
41 Afterwards, we describe the algorithm that does not have access to the Bellman optimality operator and only uses a finite sample of transitions. [sent-136, score-0.072]
42 , approximately apply the Bellman optimality operator T ∗ to the most recent estimate Qk to get a new estimate Qk+1 ≈ T ∗ Qk . [sent-140, score-0.072]
43 In this paper we would like to represent Qk+1 as a linear function of some atoms in a dictionary B = {g1 , g2 , . [sent-144, score-0.425]
44 From statistical viewpoint, the smallest representation among all those that have the same function approximation error is desirable as it leads to smaller estimation error. [sent-151, score-0.114]
45 Nevertheless, it is possible to find a “reasonable” suboptimal sparse approximation using algorithms such as OMP, which is the focus of this paper. [sent-153, score-0.065]
46 Define the new atom to be added to the representation as g (i) ∈ Argmaxg∈B r(i−1) , g , i. [sent-160, score-0.103]
47 , choose an element of the dictionary that has the maximum correlation with the residual. [sent-162, score-0.243]
48 Here · , · is the inner product for a Hilbert space H(X × A) to which T ∗ Qk and atoms of the dictionary belong. [sent-163, score-0.44]
49 To quantify the approximation error at the i-th iteration, we use the L1 (B; · )-norm of the target function of the OMP algorithm, which is T ∗ Qk in our case (with the norm being the one induced by the inner product used in the OMP procedure). [sent-175, score-0.121]
50 Recall that this class consists of functions that admit an expansion in the form g∈B cg g and (cg ) being an absolutely summable sequence. [sent-176, score-0.162]
51 This depends on how expressive the dictionary B is. [sent-183, score-0.243]
52 The empirical Bellman optimality operator T ∗ : Sn → Rn is ∗ ˆ defined as (T Q)(Xi , Ai ) Ri + γ maxa Q(Xi , a ) for 1 ≤ i ≤ n. [sent-201, score-0.089]
53 This can be a dictionary of wavelets, proto-value functions, etc. [sent-210, score-0.243]
54 The size of this dictionary can be countably infinite. [sent-211, score-0.263]
55 It also receives an integer m, which specifies how many atoms of B0 should be used by the algorithm. [sent-212, score-0.182]
56 Finally, VPI gets a set of m link functions σi : B(X × A, Qmax ) → B(X × A, Qmax ) for some m that is smaller than m/K. [sent-219, score-0.093]
57 4 Algorithm 1 Value Pursuit Iteration(B0 , m, {σi }m , ν, K) i=1 Input: Initial dictionary B0 , Number of dictionary atoms used m, Link functions {σi }m , Statei=1 action distribution ν, Number of iterations K. [sent-221, score-0.721]
58 This means that we use the empirical inner product Q1 , Q2 D(k) n 1 n (k) n i=1 |Q1 (Xi , Ai ) · Q2 (Xi , Ai )| for (Xi , Ai ) ∈ Dn and the empirical orthogonal projection. [sent-238, score-0.078]
59 Here we use a complexity regularization technique that penalizes more complex estimates (those ˆ (i) that have more atoms in their representation). [sent-242, score-0.182]
60 Finally we use link functions {σi }m to generate m new atoms, which are vector-valued Qmax i=1 bounded measurable functions from X × A to R|A| , to be added to the learned dictionary Bk . [sent-246, score-0.434]
61 The link functions extract “interesting” aspects of Qk+1 , potentially by considering the current dictionary Bk Bk . [sent-247, score-0.32]
62 VPI is quite flexible in how the new atoms are generated and how large m can be. [sent-248, score-0.182]
63 The theory allows m to be in the order of na (a > 0), so one may add many potentially useful atoms without much deterioration in the performance. [sent-249, score-0.204]
64 Regarding the choice of the link functions, the theory requires that at least Qk+1 itself is being added to the dictionary, but it leaves other possibilities open. [sent-250, score-0.064]
65 Or one might add atoms localized in parts of the state-action space with high residual errors – a heuristic which has been used previously in basis function construction. [sent-254, score-0.197]
66 In the next section, we study the theoretical properties of the greedy policy w. [sent-256, score-0.139]
67 4 When the number of atoms is larger than the number of samples (i > n), one may use the Moore–Penrose pseudoinverse to perform the orthogonal projection. [sent-268, score-0.211]
68 5 functions, we observe that FQI is the special case of VPI in which all atoms in the dictionary are used in the estimation. [sent-269, score-0.425]
69 Moreover, extending the dictionary decreases the function approximation error with negligible effect on the model selection error. [sent-271, score-0.34]
70 4 Theoretical Analysis In this section, we first study how the function approximation error propagates in VPI (Section 4. [sent-273, score-0.076]
71 1 Propagation of Function Approximation Error In this section, we present tools to upper bound the function approximation error at each iteration. [sent-278, score-0.095]
72 (II) Furthermore, for an optimal policy π ∗ and an integer m ≥ 0, ∗ let ν(P π )m ∈ M(X × A) denote the future state-action distribution obtained after m-steps of π∗ m ∗ d(ν(P ) ) π m following π ∗ . [sent-289, score-0.13]
73 The constant cν (m) shows how much we deviate from ν whenever we follow an optimal policy π ∗ . [sent-298, score-0.13]
74 It is notable that if ν happens to be the stationary distribution of the optimal policy π ∗ (e. [sent-299, score-0.149]
75 We now provide the following result that upper bounds the error caused by using Qk (which is the newly added atom to the dictionary) to approximate T ∗ Qk . [sent-302, score-0.117]
76 Let (Qi )k ⊂ B(X × A, Qmax ) be a Qmax -bounded sequence of measurable actioni=0 2 value functions. [sent-305, score-0.096]
77 2 If there was no error at earlier iterations (i. [sent-308, score-0.07]
78 Let (Qk )k−1 be a sequence of state-action value functions and define εi k=0 T ∗ Qi − Qi+1 (0 ≤ i ≤ k). [sent-318, score-0.08]
79 Then, inf Q ∈F |A| Q − T ∗ Qk ν ≤ inf Q ∈F |A| Q − (T ∗ )(k+1) Q0 ν + k−1 k−i εi ν . [sent-320, score-0.08]
80 This allows 6 us to provide a tighter upper bound for the function approximation error compared to the so-called inherent Bellman error supQ∈F |A| inf Q ∈F |A| Q − T ∗ Q ν introduced by Munos and Szepesv´ ri a [17], whenever the errors at previous iterations are small. [sent-322, score-0.231]
81 This performance loss indicates the regret of following the policy πK instead of an optimal policy when the initial state-action is distributed according to ρ. [sent-325, score-0.237]
82 We define the following concentrability coefficients similar to Farahmand et al. [sent-326, score-0.087]
83 For integers m1 , m2 ≥ k=1 1, policy π and the sequence of policies π1 , . [sent-333, score-0.149]
84 , πk define the concentrability coefficients 2 ∗ d ρ(P π )m1 (P π )m2 cVI1 ,ρ,ν (m1 , m2 ; π) E E (X, A) dν d(ρ(P πk )m1 P πk−1 P πk−2 ···P π1 ) (X, A) dν 2 1 2 and cVI2 ,ρ,ν (m1 ; π1 , . [sent-336, score-0.087]
85 • The number of atoms m used from the dictionary B0 is m = na for some finite a > 0. [sent-358, score-0.447]
86 The number of link functions m used at each iteration is at most m/K. [sent-359, score-0.116]
87 • At iteration k, each of the link functions {σi }m maps βQmax Qk+1 and the dictionary i=1 Bk Bk to an element of the space of vector-valued Qmax -bounded measurable functions X × A → R|A| . [sent-360, score-0.439]
88 The condition on the number of atoms m and the number of link functions being polynomial in n are indeed very mild. [sent-370, score-0.259]
89 Furthermore, define the k K K−1 2s discounted sum of errors as E(s) k=0 ak bk (for s ∈ [0, 1]). [sent-394, score-0.145]
90 The value of bk is a deterministic upper bound on the error Qk+1 − T ∗ Qk ν of each iteration of VPI. [sent-397, score-0.25]
91 We would like bk to be close to zero, because the second part of the theorem implies that Q∗ − QπK 1,ρ would be small too. [sent-398, score-0.12]
92 The first term inside min{·, ·} describes the behaviour of the function approximation error when we only use the predefined dictionary B0,m to approximate T ∗ Qk (see Theorem 2). [sent-402, score-0.358]
93 The second term describes the behaviour of the function approximation error when we only consider Qk as the approximant of T ∗ Qk (see Lemma 1). [sent-403, score-0.141]
94 The error caused by this approximation depends on the error made in earlier iterations. [sent-404, score-0.125]
95 The current analysis only considers the atom Qk from the learned dictionary, but VPI may actually use other atoms to represent T ∗ Qk . [sent-405, score-0.243]
96 Just as an example, if B0 is complete in L2 (ν), by letting n, m → ∞ both the estimation error and function approximation error goes to zero and the method is consistent and converges to the optimal value function. [sent-409, score-0.156]
97 5 Conclusion This work introduced VPI, an approximate value iteration algorithm that aims to find a close to optimal policy using a dictionary of atoms (or features). [sent-410, score-0.618]
98 This allows VPI to find a sparse representation of the value function in large, and potentially overcomplete, dictionaries. [sent-412, score-0.084]
99 The error bound shows the effect of the number of samples as well as the function approximation properties of the predefined dictionary, and the effect of learned atoms. [sent-414, score-0.091]
100 A more complete theory describing the effect of adding atoms to the dictionary remains to be established. [sent-416, score-0.425]
wordName wordTfidf (topN-words)
[('qk', 0.638), ('vpi', 0.463), ('qmax', 0.244), ('dictionary', 0.243), ('atoms', 0.182), ('omp', 0.178), ('dn', 0.139), ('bk', 0.12), ('policy', 0.107), ('fqi', 0.102), ('farahmand', 0.089), ('concentrability', 0.087), ('ai', 0.071), ('pursuit', 0.069), ('bellman', 0.068), ('cg', 0.062), ('barron', 0.059), ('measurable', 0.048), ('avi', 0.047), ('ronald', 0.046), ('atom', 0.046), ('absolutely', 0.046), ('link', 0.045), ('operator', 0.043), ('csaba', 0.043), ('approximation', 0.043), ('parr', 0.042), ('szepesv', 0.042), ('ri', 0.042), ('reinforcement', 0.041), ('inf', 0.04), ('behaviour', 0.039), ('iteration', 0.039), ('representation', 0.038), ('ghavamzadeh', 0.036), ('xi', 0.034), ('munos', 0.033), ('error', 0.033), ('functions', 0.032), ('greedy', 0.032), ('qi', 0.031), ('norm', 0.03), ('dictionaries', 0.03), ('gi', 0.029), ('argmaxg', 0.029), ('bqmax', 0.029), ('exhaustion', 0.029), ('lucky', 0.029), ('optimality', 0.029), ('orthogonal', 0.029), ('mohammad', 0.027), ('transitions', 0.026), ('approximant', 0.026), ('approximator', 0.026), ('cvi', 0.026), ('rl', 0.025), ('ak', 0.025), ('johns', 0.025), ('value', 0.024), ('transition', 0.024), ('christopher', 0.024), ('sequence', 0.024), ('quoted', 0.024), ('optimal', 0.023), ('bm', 0.022), ('summable', 0.022), ('mahadevan', 0.022), ('shie', 0.022), ('na', 0.022), ('sparse', 0.022), ('selection', 0.021), ('kolter', 0.021), ('iterations', 0.021), ('prede', 0.021), ('matching', 0.021), ('countably', 0.02), ('nitions', 0.02), ('editors', 0.02), ('upper', 0.019), ('stationary', 0.019), ('mdp', 0.019), ('rmax', 0.019), ('added', 0.019), ('nds', 0.019), ('planning', 0.019), ('jeff', 0.018), ('culotta', 0.018), ('policies', 0.018), ('span', 0.017), ('empirical', 0.017), ('gets', 0.016), ('earlier', 0.016), ('belongs', 0.016), ('nonparametric', 0.016), ('deterministic', 0.015), ('learned', 0.015), ('zemel', 0.015), ('inner', 0.015), ('basis', 0.015), ('propagation', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 358 nips-2012-Value Pursuit Iteration
Author: Amir M. Farahmand, Doina Precup
Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1
2 0.21179424 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
3 0.13364272 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville
Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1
4 0.10915852 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
Author: Jaldert Rombouts, Pieter Roelfsema, Sander M. Bohte
Abstract: A key function of brains is undoubtedly the abstraction and maintenance of information from the environment for later use. Neurons in association cortex play an important role in this process: by learning these neurons become tuned to relevant features and represent the information that is required later as a persistent elevation of their activity [1]. It is however not well known how such neurons acquire these task-relevant working memories. Here we introduce a biologically plausible learning scheme grounded in Reinforcement Learning (RL) theory [2] that explains how neurons become selective for relevant information by trial and error learning. The model has memory units which learn useful internal state representations to solve working memory tasks by transforming partially observable Markov decision problems (POMDP) into MDPs. We propose that synaptic plasticity is guided by a combination of attentional feedback signals from the action selection stage to earlier processing levels and a globally released neuromodulatory signal. Feedback signals interact with feedforward signals to form synaptic tags at those connections that are responsible for the stimulus-response mapping. The neuromodulatory signal interacts with tagged synapses to determine the sign and strength of plasticity. The learning scheme is generic because it can train networks in different tasks, simply by varying inputs and rewards. It explains how neurons in association cortex learn to 1) temporarily store task-relevant information in non-linear stimulus-response mapping tasks [1, 3, 4] and 2) learn to optimally integrate probabilistic evidence for perceptual decision making [5, 6]. 1
5 0.097940974 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
6 0.08992292 38 nips-2012-Algorithms for Learning Markov Field Policies
7 0.081815906 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
8 0.080800325 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
9 0.072360121 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
10 0.067371033 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
11 0.065845013 30 nips-2012-Accuracy at the Top
12 0.064241491 348 nips-2012-Tractable Objectives for Robust Policy Optimization
13 0.064090267 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
14 0.061064031 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
15 0.060102787 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
16 0.059392292 282 nips-2012-Proximal Newton-type methods for convex optimization
17 0.058851663 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
18 0.058582477 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
19 0.058577452 344 nips-2012-Timely Object Recognition
20 0.05843588 160 nips-2012-Imitation Learning by Coaching
topicId topicWeight
[(0, 0.137), (1, -0.134), (2, 0.009), (3, -0.049), (4, -0.003), (5, 0.042), (6, -0.008), (7, -0.009), (8, 0.074), (9, 0.004), (10, 0.018), (11, 0.021), (12, -0.024), (13, -0.0), (14, 0.015), (15, -0.023), (16, 0.021), (17, -0.018), (18, 0.028), (19, -0.031), (20, 0.002), (21, 0.01), (22, 0.067), (23, -0.023), (24, -0.034), (25, 0.011), (26, -0.032), (27, 0.073), (28, -0.03), (29, 0.039), (30, -0.052), (31, 0.029), (32, -0.004), (33, -0.036), (34, 0.061), (35, -0.03), (36, -0.053), (37, 0.05), (38, -0.091), (39, -0.051), (40, -0.062), (41, -0.011), (42, -0.182), (43, 0.116), (44, -0.167), (45, -0.078), (46, -0.069), (47, 0.027), (48, -0.113), (49, 0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.90982306 358 nips-2012-Value Pursuit Iteration
Author: Amir M. Farahmand, Doina Precup
Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1
2 0.6022113 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville
Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1
3 0.56948757 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
4 0.4701131 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi
Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1
5 0.4527595 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
Author: Doina Precup, Joelle Pineau, Andre S. Barreto
Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1
6 0.4221513 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
7 0.41465187 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
8 0.41433927 38 nips-2012-Algorithms for Learning Markov Field Policies
9 0.40841517 348 nips-2012-Tractable Objectives for Robust Policy Optimization
10 0.40550035 160 nips-2012-Imitation Learning by Coaching
11 0.39893132 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
12 0.39760301 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
13 0.3869279 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
14 0.38679865 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
15 0.38182926 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
16 0.38119748 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
17 0.3760426 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
18 0.37292778 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
19 0.37237045 30 nips-2012-Accuracy at the Top
20 0.36592478 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
topicId topicWeight
[(0, 0.026), (7, 0.246), (21, 0.026), (36, 0.022), (38, 0.135), (42, 0.044), (53, 0.013), (54, 0.036), (55, 0.014), (74, 0.047), (76, 0.103), (80, 0.126), (92, 0.041)]
simIndex simValue paperId paperTitle
1 0.81930089 170 nips-2012-Large Scale Distributed Deep Networks
Author: Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng
Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. 1
2 0.77926731 194 nips-2012-Learning to Discover Social Circles in Ego Networks
Author: Jure Leskovec, Julian J. Mcauley
Abstract: Our personal social networks are big and cluttered, and currently there is no good way to organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. ‘circles’ on Google+, and ‘lists’ on Facebook and Twitter), however they are laborious to construct and must be updated whenever a user’s network grows. We define a novel machine learning task of identifying users’ social circles. We pose the problem as a node clustering problem on a user’s ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter for all of which we obtain hand-labeled ground-truth. 1
3 0.76326221 293 nips-2012-Relax and Randomize : From Value to Algorithms
Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan
Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1
same-paper 4 0.75209546 358 nips-2012-Value Pursuit Iteration
Author: Amir M. Farahmand, Doina Precup
Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1
5 0.66727549 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva
Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. 1
6 0.6630069 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
7 0.66145605 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
8 0.6596691 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
9 0.65919834 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
10 0.65863401 292 nips-2012-Regularized Off-Policy TD-Learning
11 0.65839189 200 nips-2012-Local Supervised Learning through Space Partitioning
12 0.65753001 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
13 0.65747178 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
14 0.65714043 65 nips-2012-Cardinality Restricted Boltzmann Machines
15 0.65604621 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
16 0.65518379 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
17 0.65456247 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
18 0.65431303 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
19 0.65381271 227 nips-2012-Multiclass Learning with Simplex Coding
20 0.65330052 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes