nips nips2013 nips2013-79 knowledge-graph by maker-knowledge-mining

79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Source: pdf

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difﬁculties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. [sent-8, score-0.462]

2 We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. [sent-9, score-0.494]

3 1 Introduction Partially observable Markov decision processes (POMDPs) provide a principled general framework for planning in partially observable stochastic environments. [sent-17, score-0.353]

4 Finally, the number of actionobservation histories that must be considered for POMDP planning grows exponentially with the planning horizon. [sent-23, score-0.414]

5 To address these difﬁculties, online POMDP planning (see [17] for a survey) chooses one action at a time and interleaves planning and plan execution. [sent-25, score-0.697]

6 It plans the immediate next action for the current belief only and reasons in the neighborhood of the current belief, rather than over the entire belief space. [sent-27, score-0.568]

7 POMCP, which is based on Monte Carlo tree search, tries to break the two curses by sampling states from the current belief and sampling histories with a black-box simulator. [sent-30, score-0.356]

8 It enjoys the same strengths as POMCP—breaking the two curses through sampling—but avoids POMCP’s extremely poor worst-case behavior by evaluating policies on a small number of sampled scenarios [13]. [sent-40, score-0.289]

9 In each planning step, the algorithm searches for a good policy derived from a Determinized Sparse Partially Observable Tree (DESPOT) for the current belief, and executes the policy for one step. [sent-41, score-1.043]

10 It is structurally similar to a standard belief tree, but contains only belief nodes reachable under the K scenarios 1 Composition of D − 1 exponential functions. [sent-43, score-0.597]

11 While a belief tree of height D contains O(|A|D |Z|D ) nodes, where |A| and |Z| are the sizes of the action set and the observation set, respectively, a corresponding DESPOT contains only O(|A|D K) nodes, leading to dramatic improvement in computational efﬁciency when K is small. [sent-46, score-0.57]

12 One main result of this work is an output-sensitive bound, showing that a small number of sampled Figure 1: A belief tree of height D = 2 (gray) scenarios is sufﬁcient to give a good estimate and a corresponding DESPOT (black) obtained with 2 sampled scenarios. [sent-47, score-0.628]

13 Every tree nodes represents a of the true value of any policy π, provided that the size of π is small (Section 3). [sent-48, score-0.537]

14 larized DESPOT (R-DESPOT) algorithm interprets this lower bound as a regularized utility function, which it uses to optimally balance the size of a policy and its estimated performance under the sampled scenarios. [sent-51, score-0.516]

15 We show that R-DESPOT computes a near-optimal policy whenever a small optimal policy exists (Section 4). [sent-52, score-0.734]

16 2 Related Work There are two main approaches to POMDP planning: ofﬂine policy computation and online search. [sent-55, score-0.441]

17 In ofﬂine planning, the agent computes beforehand a policy contingent upon all possible future scenarios and executes the computed policy based on the observations received. [sent-56, score-1.051]

18 In contrast, online planning interleaves planning and plan execution. [sent-60, score-0.51]

19 The agent searches for a single best action for the current belief only, executes the action, and updates the belief. [sent-61, score-0.574]

20 A recent survey [17] lists three main categories of online planning algorithms: heuristic search, branch-and-bound pruning, and Monte Carlo sampling. [sent-63, score-0.304]

21 However, AR-DESPOT balances the size of a policy and its estimated performance during the online search, resulting in improved performance for suitable planning tasks. [sent-65, score-0.648]

22 In contrast, DESPOT algorithms represent the belief as a set of particles, just as POMCP [18] does, and do not perform belief update during the online search. [sent-70, score-0.486]

23 Online search and ofﬂine policy computation are complementary and can be combined, e. [sent-71, score-0.421]

24 , by using approximate or partial policies computed ofﬂine as the default policies at the bottom of the search tree for online planning (e. [sent-73, score-0.663]

25 At time t, it updates the belief bt according to Bayes’ rule by incorporating information from the action taken at time t − 1 and the resulting observation: bt = τ (bt−1 , at−1 , zt ). [sent-81, score-0.428]

26 A policy π : B → A speciﬁes the action a ∈ A at belief b ∈ B. [sent-82, score-0.729]

27 The value of a policy π at a belief b is the expected total discounted reward obtained by following π with initial ∞ t belief b: Vπ (b) = E b0 = b , for some discount factor γ ∈ [0, 1). [sent-83, score-0.911]

28 t=0 γ R st , π(bt ) 2 One way of online POMDP planning is to construct a belief tree (Figure 1), with the current belief b0 as the initial belief at the root of the tree, and perform lookahead search on the tree for a policy π that maximizes Vπ (b0 ). [sent-84, score-1.69]

29 A node branches into |A| action edges, and each action edge branches further into |Z| observation edges. [sent-86, score-0.616]

30 At each leaf node, we simulate a default policy to obtain a lower bound on its value. [sent-89, score-0.513]

31 The results are an approximately optimal policy π , represented as a policy tree, ˆ and the corresponding value Vπ (b0 ). [sent-91, score-0.734]

32 A policy tree retains only the chosen action branches, but all ˆ observation branches from the belief tree2 . [sent-92, score-0.98]

33 The size of such a policy is the number of tree nodes. [sent-93, score-0.493]

34 At each time step, we search for a policy π , as described above. [sent-98, score-0.421]

35 The agent executes the ˆ ﬁrst action a of π and receives a new observation z. [sent-99, score-0.377]

36 2 DESPOT While a standard belief tree captures the execution of all policies under all possible scenarios, a DESPOT captures the execution of all policies under a set of sampled scenarios (Figure 1). [sent-103, score-0.726]

37 It contains all the action branches, but only the observation branches under the sampled scenarios. [sent-104, score-0.34]

38 We deﬁne DESPOT constructively by applying a deterministic simulative model to all possible action sequences under K scenarios sampled from an initial belief b0 . [sent-105, score-0.612]

39 ) into the set Φbt for the belief node bt reached at the end of the subpath (a1 , z1 , a2 , z2 , . [sent-138, score-0.338]

40 Intuitively, a DESPOT is a standard belief tree with some observation branches removed. [sent-147, score-0.457]

41 While a belief tree of height D has O(|A|D |Z|D ) nodes, a corresponding DESPOT has only O(|A|D K) nodes, because of reduced observation branching under the sampled scenarios. [sent-148, score-0.473]

42 To evaluate a policy π under sampled scenarios, deﬁne Vπ,φ as the total discounted reward of the ˆ trajectory obtained by simulating π under a scenario φ. [sent-150, score-0.593]

43 We then apply the usual belief tree search from the previous subsection to a DESPOT to ﬁnd a policy having good performance under the sampled scenarios. [sent-152, score-0.812]

44 The idea of using sampled scenarios for planning is exploited in hindsight optimization (HO) as well [3, 22]. [sent-154, score-0.407]

45 In contrast, DESPOT captures all K scenarios in a single tree with O(|A|D K) nodes and allows us to reason with all scenarios simultaneously. [sent-156, score-0.452]

46 2 A policy tree can be represented more compactly by labeling each node by the action edge that follows and then removing the action edge. [sent-158, score-0.904]

47 3 4 Regularized DESPOT To search a DESPOT for a near-optimal policy, B-DESPOT chooses a best action at every internal node of the DESPOT, according to the scenarios it encounters. [sent-160, score-0.481]

48 This, however, may cause overﬁtting: the chosen policy optimizes for the sampled scenarios, but does not perform well in general, as many scenarios are not sampled. [sent-161, score-0.567]

49 To reduce overﬁtting, our R-DESPOT algorithm leverages the idea of regularization, which balances the estimated performance of a policy under the sampled scenarios and the policy size. [sent-162, score-0.934]

50 If the subtree at a DESPOT node is too large, then the performance of a policy for this subtree may not be estimated reliably with K scenarios. [sent-163, score-0.544]

51 Instead of searching the subtree for a policy, R-DESPOT terminates the search and uses a simple default policy from this node onwards. [sent-164, score-0.631]

52 The ﬁrst one provides an output-sensitive lower bound on the performance of any arbitrary policy derived from a DESPOT. [sent-166, score-0.429]

53 It implies that despite its sparsity, a DESPOT contains sufﬁcient information for approximate policy evaluation, and the accuracy depends on the size of the policy. [sent-167, score-0.367]

54 The second result shows that by optimizing this bound, we can ﬁnd a policy with small size and high value. [sent-168, score-0.367]

55 Formally, a policy tree derived from a DESPOT contains the same root as the DESPOT, but only one action branch at each internal node. [sent-171, score-0.757]

56 Let Πb0 ,D,K denote the class of all policy trees derived from DESPOTs that have height D and are constructed from K sampled scenarios for belief b0 . [sent-172, score-0.84]

57 Like a DESPOT, a policy tree π ∈ Πb0 ,D,K may not contain all observation branches. [sent-173, score-0.538]

58 If the execution of π encounters an observation branch not present in π, we simply follow the default policy from then on. [sent-174, score-0.554]

59 We now bound the error on the estimated value of a policy derived from a DESPOT. [sent-176, score-0.429]

60 The second term on the right hand side (RHS) of (2) captures the additive error in estimating the value of policy tree π, and depends on the size of π. [sent-178, score-0.493]

61 Now a natural idea is to search for a near-optimal policy π by maximizing the RHS of (2), which guarantees the performance of π by accounting for both the estimated performance and the size of π. [sent-184, score-0.421]

62 Theorem 2 Let π ∗ be an optimal policy at a belief b0 . [sent-185, score-0.573]

63 Let π be a policy derived from a DESPOT that has height D and is constructed from K randomly sampled scenarios for belief b0 . [sent-186, score-0.84]

64 Theorem 2 implies that if a small optimal policy tree π ∗ exists, then we can ﬁnd a near-optimal policy with high probability by maximizing (3). [sent-188, score-0.86]

65 Note that π ∗ is a globally optimal policy at b0 . [sent-189, score-0.367]

66 Second, R-DESPOT performs bottom-up dynamic programming on T and derive a policy tree that maximizes (3). [sent-195, score-0.569]

67 φ∈Φb where ab is the chosen action of π at the node b, CHπ (b) is the set of child nodes of b in π, and sφ is the start state associated with the scenario φ. [sent-199, score-0.422]

68 We now describe the dynamic programming procedure that searches for an optimal policy in T . [sent-200, score-0.455]

69 For any belief node b in T , let ν ∗ (b) be the maximum RWDU of b under any policy tree π derived from ˆ T . [sent-201, score-0.828]

70 If b is a leaf node of T , ν ∗ (b) = |Φb | γ ∆(b) Vπ0 (b) − λ, for some K default policy π0 . [sent-203, score-0.58]

71 The ﬁrst maximization in (4) chooses between executing the default policy or expanding the subtree at b. [sent-205, score-0.509]

72 The value of an optimal policy for the DESPOT T rooted at the belief b0 is then ν ∗ (b0 ) and can be computed with bottom-up dynamic programming in time linear in the size of T . [sent-207, score-0.625]

73 5 Anytime Regularized DESPOT To further improve online planning performance for large-scale POMDPs, we introduce ARDESPOT, an anytime approximation of R-DESPOT. [sent-208, score-0.355]

74 AR-DESPOT applies heuristic search and branchand-bound pruning to uncover the more promising parts of a DESPOT and then searches the partially constructed DESPOT for a policy that maximizes the regularized utility in Theorem 2. [sent-209, score-0.594]

75 Initially, T contains only the root node with associated belief b0 and a set Φb0 of scenarios sampled according b0 . [sent-217, score-0.545]

76 For every belief node b in T , we maintain ˆ an upper bound U (b) and a lower bound L(b) on Vπ∗ (b), which is the value of the optimal policy π ∗ for b under the set of scenarios Φb . [sent-219, score-0.911]

77 4: Compute an optimal policy π ∗ for T us5: 6: 7: RUN T RIAL(b, T ) 1: if ∆(b) > D then 2: return b 3: if b is a leaf node then 4: Expand b one level deeper, and insert all new nodes into T as children of b. [sent-225, score-0.59]

78 We then expand the leaf node b one level deeper by adding new belief nodes for every action and every observation as children of b. [sent-238, score-0.592]

79 |Φb | (5) where Zb,a is the set of observations encountered when action a is taken at b under all scenarios in Φb . [sent-241, score-0.297]

80 The upper bound for a particular scenario φ ∈ Φb is the maximum value achieved by any arbitrary policy under φ. [sent-249, score-0.468]

81 Given φ, we have a deterministic planning problem and solve it by dynamic programming on a trellis of D time slices. [sent-250, score-0.296]

82 3 Initial Lower Bounds and Default Policies To construct the lower bound at a node b, we may simulate any policy for N steps under the scenarios in Φb and compute the average total discounted reward, all in O(|Φb |N ) time. [sent-256, score-0.702]

83 One possibility is to use a ﬁxed-action policy for this purpose. [sent-257, score-0.367]

84 A better one is to handcraft a policy that chooses an action based on the history of actions and observations, a technique used in [18]. [sent-258, score-0.605]

85 We thus construct a policy using the belief b: π(b) = f (Λ(b)), where Λ(b) is the mode of the probability distribution b and f : S → A is a mapping that speciﬁes the action at the state s ∈ S. [sent-260, score-0.765]

86 The modiﬁed policy always tags when the agent is in the same position as the robot, providing better performance. [sent-287, score-0.507]

87 For AR-DESPOT, we use a simple particle set default policy, which moves the agent towards the mode of the target in the particle set. [sent-288, score-0.391]

88 Theorem 1 suggests that AR-DESPOT may still perform well when the observation space is large, if a good small policy exists. [sent-292, score-0.412]

89 The behavior of the agent and opponent are identical to that in Tag, except that in LaserTag the agent knows it location before the game starts, whereas in Tag this happens only after the ﬁrst observation is seen. [sent-295, score-0.325]

90 For AR-DESPOT, we use a default policy derived from the particle set as follows: a new state is created with the positions of the robot and the rocks unchanged, and each rock is labeled as good or bad depending on whichever condition is more prevalent in the particle set. [sent-382, score-0.741]

91 The optimal policy for the resulting state is used as the default policy. [sent-383, score-0.475]

92 The optimal policy for all states is computed before the algorithm begins, using dynamic programming with the same horizon length as the maximum depth of the search tree. [sent-384, score-0.473]

93 As in [18], we use a particle ﬁlter to represent the belief to examine the behavior of the algorithms in very large state spaces. [sent-387, score-0.316]

94 For the lower bound, we use a history-based policy that chases a random ghost, if visible, when pocman is under the effect of a powerpill, and avoids ghosts and doubling-back when it is not. [sent-396, score-0.474]

95 Our R-DESPOT algorithm and its anytime approximation, AR-DESPOT, search a DESPOT for an approximately optimal policy, while balancing the size of the policy and the accuracy on its value estimate. [sent-399, score-0.495]

96 SARSOP: Efﬁcient point-based POMDP planning by approximating optimally reachable belief spaces. [sent-479, score-0.413]

97 On the undecidability of probabilistic planning and inﬁnite-horizon partially observable Markov decision problems. [sent-486, score-0.296]

98 Approximate planning for factored POMDPs using belief state simpliﬁcation. [sent-493, score-0.449]

99 PEGASUS: A policy search method for large MDPs and POMDPs. [sent-500, score-0.421]

100 AEMS: An anytime online search algorithm for approximate policy reﬁnement in large POMDPs. [sent-529, score-0.569]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('despot', 0.635), ('policy', 0.367), ('planning', 0.207), ('pomcp', 0.207), ('belief', 0.206), ('pomdp', 0.202), ('action', 0.156), ('scenarios', 0.141), ('agent', 0.14), ('pomdps', 0.134), ('tree', 0.126), ('node', 0.099), ('pocman', 0.083), ('tag', 0.08), ('branches', 0.08), ('online', 0.074), ('anytime', 0.074), ('particle', 0.074), ('default', 0.072), ('sarsop', 0.069), ('policies', 0.065), ('discounted', 0.063), ('ine', 0.061), ('sampled', 0.059), ('observable', 0.057), ('determinized', 0.055), ('lasertag', 0.055), ('search', 0.054), ('reward', 0.047), ('rmax', 0.046), ('observation', 0.045), ('nodes', 0.044), ('ch', 0.043), ('leaf', 0.042), ('rial', 0.041), ('weu', 0.041), ('root', 0.04), ('laser', 0.04), ('subtree', 0.039), ('insert', 0.038), ('branch', 0.038), ('height', 0.037), ('trellis', 0.037), ('state', 0.036), ('searches', 0.036), ('executes', 0.036), ('scenario', 0.035), ('upper', 0.034), ('pellet', 0.034), ('ghost', 0.034), ('regularized', 0.033), ('bt', 0.033), ('bound', 0.032), ('execution', 0.032), ('partially', 0.032), ('rock', 0.032), ('lookahead', 0.032), ('chooses', 0.031), ('moves', 0.031), ('derived', 0.03), ('excess', 0.03), ('uncertainty', 0.029), ('backup', 0.029), ('rs', 0.029), ('dynamic', 0.028), ('ab', 0.028), ('curse', 0.028), ('ardespot', 0.028), ('despots', 0.028), ('espot', 0.028), ('handcraft', 0.028), ('rocks', 0.028), ('rocksample', 0.028), ('rwdu', 0.028), ('simulative', 0.028), ('uild', 0.028), ('robot', 0.028), ('rhs', 0.027), ('kd', 0.026), ('utility', 0.025), ('particles', 0.025), ('curses', 0.024), ('ghosts', 0.024), ('child', 0.024), ('maximizes', 0.024), ('programming', 0.024), ('ln', 0.024), ('actions', 0.023), ('heuristic', 0.023), ('givan', 0.022), ('interleaves', 0.022), ('intelligence', 0.022), ('trajectory', 0.022), ('distance', 0.022), ('initial', 0.022), ('arti', 0.022), ('uct', 0.021), ('eating', 0.021), ('path', 0.021), ('cial', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

2 0.31039557 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our ﬁrst main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.29305822 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have signiﬁcantly grown in popularity in the reinforcement–learning ﬁeld. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identiﬁcation of effective gradient directions and the proposal of efﬁcient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly inﬂuenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simpliﬁed version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.27494735 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Author: Liam C. MacDermed, Charles Isbell

Abstract: We present four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard inﬁnite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the ﬁrst tractable value iteration method for DecPOMDPs that outperforms existing algorithms. 1

5 0.23839213 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

6 0.22566116 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

7 0.21031451 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.20139745 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.19974381 241 nips-2013-Optimizing Instructional Policies

10 0.1768343 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

11 0.17449042 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

12 0.1708028 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

13 0.16911621 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

14 0.16795233 257 nips-2013-Projected Natural Actor-Critic

15 0.15611775 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

16 0.15256977 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

17 0.13834633 347 nips-2013-Variational Planning for Graph-based MDPs

18 0.13578795 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

19 0.12381994 165 nips-2013-Learning from Limited Demonstrations

20 0.12363046 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.212), (1, -0.366), (2, -0.202), (3, 0.158), (4, 0.027), (5, 0.02), (6, -0.015), (7, 0.023), (8, -0.018), (9, 0.073), (10, 0.024), (11, -0.028), (12, 0.118), (13, 0.019), (14, -0.022), (15, 0.069), (16, 0.028), (17, 0.062), (18, 0.054), (19, -0.064), (20, -0.023), (21, 0.038), (22, -0.03), (23, 0.001), (24, 0.01), (25, 0.005), (26, 0.006), (27, 0.027), (28, 0.07), (29, -0.001), (30, 0.029), (31, -0.065), (32, 0.009), (33, -0.078), (34, 0.117), (35, 0.032), (36, 0.096), (37, 0.002), (38, 0.02), (39, -0.074), (40, 0.001), (41, 0.022), (42, 0.028), (43, -0.022), (44, -0.028), (45, -0.02), (46, 0.001), (47, -0.028), (48, -0.001), (49, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97479683 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

2 0.82343638 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our ﬁrst contribution is a symbolic implementation of Modiﬁed Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show signiﬁcantly improved scalability over state-of-the-art symbolic planners. 1

3 0.75240803 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Author: Liam C. MacDermed, Charles Isbell

4 0.74764955 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

5 0.73802078 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classiﬁcation-based modiﬁed policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the ﬁrst time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1

6 0.73530149 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

7 0.72573471 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.72353387 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

9 0.68556279 241 nips-2013-Optimizing Instructional Policies

10 0.66492575 165 nips-2013-Learning from Limited Demonstrations

11 0.65407425 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

12 0.64483446 347 nips-2013-Variational Planning for Graph-based MDPs

13 0.62467998 257 nips-2013-Projected Natural Actor-Critic

14 0.61859387 348 nips-2013-Variational Policy Search via Trajectory Optimization

15 0.59104156 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

16 0.57196939 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

17 0.56113249 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

18 0.54011643 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

19 0.53926295 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games

20 0.53784269 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.049), (16, 0.035), (33, 0.065), (34, 0.138), (36, 0.012), (41, 0.038), (49, 0.049), (56, 0.128), (70, 0.032), (85, 0.065), (88, 0.226), (89, 0.023), (93, 0.039), (95, 0.014), (99, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78669155 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

2 0.73786259 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Author: Aijun Bai, Feng Wu, Xiaoping Chen

Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results conﬁrm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1

3 0.701617 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

Author: John Duchi, Michael Jordan, Brendan McMahan

Abstract: We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difﬁculties—in terms of increased sample complexity that sparse data necessitates—and the potential beneﬁts, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: minimize f (x) := E[F (x; ξ)] = x∈X F (x; ξ)dP (ξ). (1) Ξ By data sparsity, we mean the samples ξ are sparse: assuming that samples ξ lie in Rd , and deﬁning the support supp(x) of a vector x to the set of indices of its non-zero components, we assume supp F (x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications; in spite of its prevalence, study of such problems has been limited. As a motivating example, consider a text classiﬁcation problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp( ξ, x )) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So 1 Formally, if πξ denotes the coordinate projection zeroing all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This follows from the ﬁrst-order conditions for convexity [6]. 1 while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have signiﬁcant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel (and even asynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical effects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difﬁcult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the A DAG RAD algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (A SYNC DA) and asynchronous A DAG RAD (A SYNC A DAG RAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s H OGWILD ! [12] with a new analysis, we prove our algorithms achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classiﬁcation and web-advertising tasks to illustrate the beneﬁts of the new algorithms. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x denote any estimator for a minimizer of the objective (1). We deﬁne the optimality gap N for the estimator x based on N samples ξ 1 , . . . , ξ N from the distribution P as N (x, F, X , P ) := f (x) − inf f (x) = EP [F (x; ξ)] − inf EP [F (x; ξ)] . x∈X x∈X This quantity is a random variable, since x is a random variable (it is a function of ξ 1 , . . . , ξ N ). To deﬁne the minimax error, we thus take expectations of the quantity N , though we require a bit more than simply E[ N ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F speciﬁed by a collection F of convex losses F : X × ξ → R. We can then deﬁne the minimax error for the family of losses F and distributions P as ∗ N (X , P, F) := inf sup sup EP [ x P ∈P F ∈F N (x(ξ 1:N ), F, X , P )], (3) where the inﬁmum is taken over all possible estimators x (an estimator is an optimization scheme, or a measurable mapping x : ΞN → X ) . 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist of distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj = 0) be the marginal probability of appearance of feature j ∈ {1, . . . , d}. For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain 2 Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then √ d pj 1 ∗ . Mj min pj , √ N (X , P, F) ≥ R 8 j=1 N log 3 We provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj = 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj = 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ∗ N (X , P, F) ≥ 2−α 1−α p0 p0 (p0 N ) 2α − 1 + d1−α − (p0 N ) α N 1−α 2 MR 8 2−α (2) If d ≤ (p0 N )1/α , ∗ N (X , P, F) ≥ MR 8 p0 N α 1 1 d1− 2 − 1 − α/2 1 − α/2 . . Expanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that d ≤ (p0 N )1/α . Then we ﬁnd that the lower bound on optimization error is of order p0 1− α p0 p0 d 2 when α < 2, M R log d when α → 2, and M R when α > 2. (4) N N N These results beg the question of tightness: are they improvable? As we see presently, they are not. MR 2.2 Algorithms for attaining the minimax rate To show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we review a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD repeatedly samples ξ ∼ P , computes g ∈ ∂x F (x; ξ), then performs the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Standard analyses of stochastic gradient descent [10] show that after N samples ξ i , the SGD estimator x(N ) satisﬁes R2 M ( d j=1 1 pj ) 2 √ , (5) N where R2 denotes the 2 -radius of X . Dual averaging, due to Nesterov [11] (sometimes called “follow the regularized leader” [5]) is a more recent algorithm. In dual averaging, one again samples g ∈ ∂x F (x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes 1 x ← argmin z, x + ψ(x) , η x∈X E[f (x(N ))] − inf f (x) ≤ O(1) x∈X 2 1 where ψ(x) is a strongly convex function deﬁned over X (often one takes ψ(x) = 2 x 2 ). As we discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The A DAG RAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, A DAG RAD performs the following: it computes g ∈ ∂x F (x; ξ), then updates 2 Sj ← Sj + gj for j ∈ [d]. The dual averaging variant of A DAG RAD updates the usual dual vector z ← z + g; the update to x is based on S and a stepsize η and computes x ← argmin z, x + x ∈X 3 1 1 x ,S2x 2η . After N samples ξ, the averaged parameter x(N ) returned by A DAG RAD satisﬁes R∞ M E[f (x(N ))] − inf f (x) ≤ O(1) √ x∈X N d √ pj , (6) j=1 where R∞ denotes the ∞ -radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the A DAG RAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation √ shows that the multiplier for the SGD guarantee (5) becomes R∞ d max{d(1−α)/2 , 1}, while A DA G RAD attains rate at worst R∞ max{d1−α/2 , log d}. For α > 1, the A DAG RAD rate is no worse, √ and for α ≥ 2, is more than d/ log d better—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield beneﬁts in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of A DAG RADtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we ﬁrst revisit Niu et al.’s H OGWILD ! [12]. H OGWILD ! is an asynchronous (parallelized) stochastic gradient algorithm for optimization over product-space domains, meaning that X in problem (1) decomposes as X = X1 × · · · × Xd , where Xj ⊂ R. Fix a stepsize η > 0. A pool of independently running processors then performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂x F (x; ξ) 3. For each j s.t. gj = 0, update xj ← ΠXj (xj − ηgj ). Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of H OGWILD ! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show linear speedup in optimization time as the number of processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumptions that there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients. 3.1 Asynchronous dual averaging A weakness of H OGWILD ! is that it appears only applicable to problems for which the domain X is a product space, and its analysis assumes g 0 = 1 for all gradients g. In effort to alleviate these difﬁculties, we now develop and present our asynchronous dual averaging algorithm, A SYNC DA. A SYNC DA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of processors perform asynchronous updates to z, where each processor independently iterates: 1. Read z and compute x := argminx∈X 1 z, x + η ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj = 0, update zj ← zj + gj . Because the actual computation of the vector x in A SYNC DA is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm is a natural strategy for avoiding synchronization problems. In our analysis of A SYNC DA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the A SYNC DA 4 algorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity t exists and is recoverable from the algorithm, and it is possible to track the running sum τ =1 x(τ ). Additionally, we state two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj = 0) ≤ pj . We also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a second moment constraint on the instantaneous ∂F and a rough measure of gradient sparsity. Assumption B. There exist constants M and (Mj )d such that the following bounds hold for all j=1 2 x ∈ X : E[ ∂x F (x; ξ) 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these deﬁnitions, we have the following theorem, which captures the convergence behavior of A SYNC DA under the assumption that X is a Cartesian product, meaning that X = X1 × · · · × Xd , 2 where Xj ⊂ R, and that ψ(x) = 1 x 2 . Note the algorithm itself can still be efﬁciently parallelized 2 for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then T E t=1 F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ 1 x∗ 2η d 2 2 η 2 p2 Mj . + T M2 + ηT m j 2 j=1 We now provide a few remarks to explain and simplify the result. Under the more stringent condition 2 d 2 that |∂xj F (x; ξ)| ≤ Mj , Assumption A implies E[ ∂x F (x; ξ) 2 ] ≤ j=1 pj Mj . Thus, for the d 2 remainder of this section we take M2 = j=1 pj Mj , which upper bounds the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ T 1 Corollary 4. Deﬁne x(T ) = T t=1 x(t), and set η = x∗ 2 /M T . Then E[f (x(T )) − f (x∗ )] ≤ M x∗ √ T 2 +m x∗ 2 √ 2M T d 2 p2 M j . j j=1 Corollary 4 is nearly immediate: since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (x(T )) and performing an algebraic manipulation give the result. If the data is suitably sparse, meaning that pj ≤ 1/m, the bound in Corollary 4 simpliﬁes to 3 M x∗ √ E[f (x(T )) − f (x )] ≤ 2 T ∗ 2 3 = 2 d j=1 2 pj M j x ∗ √ T 2 , (7) which is the convergence rate of stochastic gradient descent even in centralized settings (5). The √ convergence guarantee (7) shows that after T timesteps, the error scales as 1/ T ; however, if we have k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time scaling as N/k, we can evaluate N gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending A DAG RAD to asynchronous settings, developing A SYNC A DAG RAD (asynchronous A DAG RAD). As in the A SYNC DA algorithm, A SYNC A DAG RAD maintains a shared dual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1 1 1. Read S and z and set G = S 2 . Compute x := argminx∈X { z, x + 2η x, Gx } increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F (x; ξ) 2 3. For j ∈ [d] such that gj = 0, update Sj ← Sj + gj and zj ← zj + gj . 5 // Implicitly As in the description of A SYNC DA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze A SYNC A DAG RAD, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. 2 Assumption C. There exist constants (Mj )d such that E[(∂xj F (x; ξ))2 | ξj = 0] ≤ Mj for all j=1 x ∈ X. 2 Indeed, taking M2 = j pj Mj shows that Assumption C implies Assumption B with speciﬁc constants. We then have the following convergence result. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all 2 j we have δ 2 ≥ Mj m and X ⊂ [−R∞ , R∞ ]d . Then T t=1 E F (x(t); ξ t ) − F (x∗ ; ξ t ) d ≤ min j=1 T 1 2 R E η ∞ 2 δ + gj (t) 2 1 2 T + ηE gj (t) t=1 2 1 2 (1 + pj m), Mj R∞ pj T . t=1 It is possible to relax the condition on the initial constant diagonal term; we defer this to the full version of the paper. It is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as in the case with Theorem 3, we may obtain a convergence rate for f (x(T )) − f (x∗ ) using convexity, T 1 where x(T ) = T t=1 x(t). By Jensen’s inequality, we have for any δ that T E 2 δ + gj (t) 2 1 2 t=1 T ≤ 2 2 E[gj (t) ] δ + t=1 1 2 ≤ 2 δ 2 + T pj Mj . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that d log(T )/T + pj √ (8) , pj , T j=1 √ which is the convergence rate of A DAG RAD except for a small factor of min{ log T /T, pj } in addition to the usual pj /T rate. In particular, optimizing by choosing η = R∞ , and assuming 1 pj T log T , we have convergence guarantee √ d pj E[f (x(T )) − f (x∗ )] ≤ O(1)R∞ Mj min √ , pj , T j=1 E[f (x(T )) − f (x∗ )] ≤ O(1) 1 2 R +η η ∞ Mj min which is minimax optimal by Proposition 1. In fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the expectation of the squared gradients gj (t) rather than the maximal value Mj , though the bounds are perhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings than stochastic optimization (e.g., to online convex optimization [5]). Speciﬁcally, an adversary may choose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our results provide an expected regret bound, which is strictly stronger than the stochastic convergence guarantees provided (and guarantees high-probability convergence in stochastic settings [3]). Moreover, our comments in Section 2 about the relative optimality of A DAG RAD versus standard gradient methods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using adaptive methods yields even more improvement than simple gradient-based methods. 4 Experiments In this section, we give experimental validation of our theoretical results on A SYNC A DAG RAD and A SYNC DA, giving results on two datasets selected for their high-dimensional sparsity.2 2 In our experiments, A SYNC DA and H OGWILD ! had effectively identical performance. 6 8 0.07 6 5 4 0.024 Test error Training loss Speedup 0.025 0.065 7 0.06 0.055 0.05 0.045 0.04 0.023 0.022 0.021 0.02 0.035 3 0.019 2 1 2 4 0.03 A-A DAG RAD A SYNC DA Number of workers 6 8 10 12 14 0.018 0.025 0.02 16 2 4 6 8 10 12 14 Number of workers 0.017 16 2 4 6 8 10 12 14 Number of workers 16 Figure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. AA DAG RAD abbreviates A SYNC A DAG RAD. 1.03 1.02 1.01 1.00 1.0 1 2 4 8 16 64 256 number of passes A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80 1.00 1.01 1.4 1.02 1.03 1.04 Impact of L2 regularizaton on test error 1.04 Fixed stepsizes, test data, L2=0 1.2 relative log-loss 1.6 1.8 Fixed stepsizes, training data, L2=0 A-AdaGrad η = 0.002 A-AdaGrad η = 0.004 A-AdaGrad η = 0.008 A-AdaGrad η = 0.016 A-DA η = 0.800 A-DA η = 1.600 A-DA η = 3.200 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset. 4.1 Malicious URL detection For our ﬁrst set of experiments, we consider the speedup attainable by applying A SYNC A DAG RAD and A SYNC DA, investigating the performance of each algorithm on a malicious URL prediction task [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of 2.4 · 106 examples with dimension d = 3.2 · 106 (sparse) features. We perform several experiments, randomly dividing the dataset into 1.2 · 106 training and test samples for each experiment. In Figure 1 we compare the performance of A SYNC A DAG RAD and A SYNC DA after doing after single pass through the training dataset. (For each algorithm, we choose the stepsize η for optimal training set performance.) We perform the experiments on a single machine running Ubuntu Linux with six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1, we see that up to six processors, both A SYNC DA and A SYNC A DAG RAD enjoy the expected linear speedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors though at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is no further beneﬁt to parallelism on this machine. The two right plots in Figure 1 plot performance of the different methods (with standard errors) versus the number of worker threads used. Both are essentially ﬂat; increasing the amount of parallelism does nothing to the average training loss or the test error rate for either method. It is clear, however, that for this dataset, the adaptive A SYNC A DAG RAD algorithm provides substantial performance beneﬁts over A SYNC DA. 4.2 Click-through-rate prediction experiments We also experiment on a proprietary datasets consisting of search ad impressions. Each example corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-speciﬁc data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 7 (B) A-AdaGrad speedup (D) Impact of training data ordering 1.004 1.005 1.006 1.007 1.008 1 2 4 8 16 32 number of passes 64 128 256 1.000 1 2 A-DA base η = 1.600 A-AdaGrad base η = 0.023 0 1.005 relative stepsize (C) Optimal stepsize scaling relative log-loss 1.003 target relative log-loss 1.005 1.010 1.002 1.010 1.015 8 4 0 speedup A-DA η = 1.600 A-AdaGrad η = 0.016 1.001 1.000 relative log-loss 1.015 A-DA, L2=80 A-AdaGrad, L2=80 12 (A) Optimized stepsize for each number of passes 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 3. (A) Relative test-set log-loss for A SYNC DA and A SYNC A DAG RAD, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for A SYNC A DAG RAD. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with 2 penalty 80). We ﬁt logistic regression models using both A SYNC DA and A SYNC A DAG RAD. We run extensive experiments on a moderate-sized dataset (about 107 examples, split between training and testing), which allows thorough investigation of the impact of the stepsize η, the number of training passes,3 and 2 -regularization on accuracy. For these experiments we used 32 threads on 16 core machines for each run, as A SYNC A DAG RAD and A SYNC DA achieve similar speedups from parallelization. On this dataset, A SYNC A DAG RAD typically achieves an effective additional speedup over A SYNC DA of 4× or more. That is, to reach a given level of accuracy, A SYNC DA generally needs four times as many effective passes over the dataset. We measure accuracy with log-loss (the logistic loss) averaged over ﬁve runs using different random seeds (which control the order in which the algorithms sample examples during training). We report relative values in Figures 2 and 3, that is, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our results are not particularly sensitive to the choice of relative log-loss as the metric of interest; we also considered AUC (the area under the ROC curve) and observed similar results. Figure 2 shows relative log-loss as a function of the number of training passes for various stepsizes. Without regularization, A SYNC A DAG RAD is prone to overﬁtting: it achieves signiﬁcantly higher accuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number of passes, it will overﬁt (Fig. 2 (middle)). Fortunately, the addition of 2 regularization largely solves this problem. Indeed, Figure 2 (right) shows that while adding an 2 penalty of 80 has very little impact on A SYNC DA, it effectively prevents the overﬁtting of A SYNC A DAG RAD.4 Fixing √ regularization multiplier to 80, we varied the stepsize η over a multiplicative grid with res2 olution 2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by selecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows relative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time A SYNC DA requires with respect to A SYNC A DAG RAD to achieve a given loss. Speciﬁcally, Fig. 3(B) shows the ratio of the number of passes the algorithms require to achieve a ﬁxed loss, which gives a broader estimate of the speedup obtained by using A SYNC A DAG RAD; speedups range from 3.6× to 12×. Figure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal stepsize decreases moderately for A SYNC A DAG RAD, but are somewhat noisy for A SYNC DA. It is interesting to note that A SYNC A DAG RAD’s accuracy is largely independent of the ordering of the training data, while A SYNC DA shows signiﬁcant variability. This can be seen both in the error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the ﬁve random seeds used. Thus, while on the one hand A SYNC DA requires somewhat less tuning of the stepsize and 2 parameter, tuning A SYNC A DAG RAD is much easier because of its predictable response. 3 Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed. 4 For both algorithms, this is accomplished by adding the term η80 x 2 to the ψ function. We can achieve 2 slightly better results for A SYNC A DAG RAD by varying the 2 penalty with the number of passes. 8 References [1] P. Auer and C. Gentile. Adaptive and self-conﬁdent online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and u Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. e Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ tze. Foundations of Statistical Natural Language Processing. MIT u Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´ , and S. Wright. Hogwild: a lock-free approach to parallelizing stochase tic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´ rik and M. Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´ c, A. Bijral, P. Richt´ rik, and N. Srebro. Mini-batch primal and dual methods for aˇ a SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. 9

4 0.65611666 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis

Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a ﬁnite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Speciﬁcally, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1

5 0.65564823 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspeciﬁed. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspeciﬁcation. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1

6 0.65220934 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

7 0.64721048 347 nips-2013-Variational Planning for Graph-based MDPs

8 0.64676499 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

9 0.64312035 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

10 0.64085257 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

11 0.64059776 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

12 0.64011055 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

13 0.63993692 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

14 0.63924706 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

15 0.63917291 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

16 0.63841337 249 nips-2013-Polar Operators for Structured Sparse Estimation

17 0.63827127 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

18 0.63801634 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

19 0.63761806 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

20 0.63746154 25 nips-2013-Adaptive Anonymity via $b$-Matching