nips nips2012 nips2012-241 knowledge-graph by maker-knowledge-mining

241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization


Source: pdf

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . [sent-6, score-0.342]

2 Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. [sent-7, score-0.561]

3 We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. [sent-8, score-0.441]

4 In particular, regret with respect to ˚ = 0 is constant. [sent-9, score-0.359]

5 1 Introduction Over the past several years, online convex optimization has emerged as a fundamental tool for solving problems in machine learning (see, e. [sent-11, score-0.195]

6 The reduction from general online convex optimization to online linear optimization means that simple and efficient (in memory and time) algorithms can be used to tackle large-scale machine learning problems. [sent-14, score-0.334]

7 The key theoretical techniques behind essentially all the algorithms in this field are the use of a fixed or increasing strongly convex regularizer (for gradient descent algorithms, this is equivalent to a fixed or decreasing learning rate sequence). [sent-15, score-0.273]

8 √ This approach produces regret bounds of the form O R T log((1 + R)T ) , where R = ˚ 2 is the x L2 norm of an arbitrary comparator. [sent-19, score-0.425]

9 A consequence of this is that we can x guarantee at most constant regret with respect to the origin, ˚ = 0. [sent-21, score-0.424]

10 This technique can be applied to x any online convex optimization problem where a fixed feasible set is not an essential component of the problem. [sent-22, score-0.239]

11 For appropriately chosen σ and , this becomes a problem of online convex optimization against functions ft (x) = (σ(at ·x), yt ). [sent-25, score-0.248]

12 1 likely than large ones, but this is rarely best encoded as a feasible set F, which says: “all xt ∈ F are equally likely, and all other xt are ruled out. [sent-28, score-0.3]

13 While 2 algorithms of this form have proved very effective at solving these problems, theoretical guarantees usually require fixing a feasible set of radius R, or at least an intelligent guess of the norm of an optimal comparator ˚. [sent-30, score-0.34]

14 , [3]), there are n experts, and on each round t the player selects an expert (say i), and obtains reward gt,i from a bounded interval (say [−1, 1]). [sent-33, score-0.458]

15 Typically, one uses an algorithm that proposes a probability distribution pt on experts, so the expected reward is pt · gt . [sent-34, score-0.662]

16 Our algorithms apply to an unconstrained version of this problem: there are still n experts with payouts in [−1, 1], but rather than selecting an individual expert, the player can place a “bet” of xt,i on each expert i, and then receives reward i xt,i gt,i = xt · gt . [sent-35, score-0.93]

17 The bets are unconstrained (betting a negative value corresponds to betting against the expert). [sent-36, score-0.173]

18 In this setting, a natural goal is the following: place bets so as to achieve as much reward as possible, subject to the constraint that total losses are bounded by a constant (which can be set equal to some starting budget which is to be invested). [sent-37, score-0.348]

19 Our algorithms can satisfy constraints of this form because regret with respect to ˚ = 0 x (which equals total loss) is bounded by a constant. [sent-38, score-0.359]

20 It is useful to contrast our results in this setting to previous applications of online convex optimization to portfolio management, for example [6] and [2]. [sent-39, score-0.233]

21 Therefore, we consider online linear optimization where the goal is to maximize cumulative reward given adversarially selected linear reward functions ft (x) = gt · x. [sent-49, score-1.077]

22 T , the algorithm selects a point xt ∈ Rn , receives reward ft (xt ) = gt · xt , and observes gt . [sent-53, score-1.303]

23 For simplicity, we assume gt,i ∈ [−1, 1], that is, gt ∞ ≤ 1. [sent-54, score-0.349]

24 If the real problem is against convex loss functions t (x), they can be converted to our framework by taking gt = − t (xt ) (see pseudo-code for R EWARD -D OUBLING), using the standard reduction from online convex optimization to online linear optimization [13]. [sent-55, score-0.755]

25 We study the reward of our algorithms, and their regret against a fixed comparator ˚: x T Reward ≡ T gt · xt and Regret(˚) ≡ g1:T · ˚ − x x t=1 gt · xt . [sent-57, score-1.773]

26 t=1 Comparison of Regret Bounds The primary contribution of this paper is to establish matching upper and lower bounds for unconstrained online convex optimization problems, using algorithms that require no prior information about the comparator point √. [sent-58, score-0.577]

27 To obtain x x x x √ this guarantee, we show that it is sufficient (and necessary) that reward is Ω(exp(|g1:T |/ T )) (see Theorem 1). [sent-60, score-0.258]

28 x Table 1 compares the bounds for R EWARD -D OUBLING (this paper) to those of two previous algorithms: online gradient descent [13] and projected exponentiated gradient descent [8, 12]. [sent-62, score-0.58]

29 For each Our bounds are not directly comparable to the bounds cited above: a O(log(T )) regret bound on log√ wealth implies wealth at least O OPT/T , whereas we guarantee wealth like O OPT’ − T . [sent-63, score-0.746]

30 2 Assuming gt 2 ≤ 1: Gradient Descent, η = ˚= 0 x √ R T R √ T R EWARD -D OUBLING Assuming gt ∞ ˚ 2≤R x √ R T √ R T log n(1+R)T ˚ x √ 2 Arbitrary ˚ x ˚ 2T x T log n(1+ ˚ x 2 )T ≤ 1: Exponentiated G. [sent-65, score-0.794]

31 R EWARD -D OUBLING ˚= 0 x √ R T log n ˚ 1≤R x √ R T log n √ R T log n(1+R)T Arbitrary ˚ x ˚ 1T x √ x ˚ 1 T log n(1+ ˚ x √ 1) T Table 1: Worst-case regret bounds for various algorithms (up to constant factors). [sent-67, score-0.641]

32 algorithm, we consider a fixed choice of parameter settings and then look at how regret changes as we vary the comparator point ˚. [sent-71, score-0.561]

33 x Gradient descent is minimax-optimal [1] when the comparator point is contained in a hypershere whose radius is known in advance ( ˚ 2 ≤ R) and gradients are sparse ( gt 2 ≤ 1, top table). [sent-72, score-0.764]

34 x Exponentiated gradient descent excels when gradients are dense ( gt ∞ ≤ 1, bottom table) but the comparator point is sparse ( ˚ 1 ≤ R for R known in advance). [sent-73, score-0.789]

35 When ˚ = 0, R EWARD -D OUBLING offers constant regret x √ x compared to Ω( T ) for the other algorithms. [sent-76, score-0.383]

36 When ˚ can be arbitrary, only R EWARD -D OUBLING offers sub-linear regret (and in fact its regret bound is optimal, as shown in Theorem 8). [sent-77, score-0.756]

37 Related Work Our work is related, at least in spirit, to the use of a momentum term in stochastic gradient descent for back propagation in neural networks [7, 11, 9]. [sent-80, score-0.191]

38 In Follow-The-Regularized-Leader terms, the exponentiated gradient descent algorithm with unnor1 malized weights of Kivinen and Warmuth [8] plays xt+1 = arg minx∈Rn g1:t · x + η (x log x − x), + which has closed-form solution xt+1 = exp(−ηg1:t ). [sent-82, score-0.339]

39 Like our algorithm, this algorithm moves away from the origin exponentially fast, but unlike our algorithm it can incur arbitrarily large regret with respect to ˚ = 0. [sent-83, score-0.442]

40 Hazan and Kale [5] give regret bounds in terms of the variance of the gt . [sent-85, score-0.774]

41 Letting G = |g1:t | and √ T 2 the H = t=1 gt , they prove regret bounds of √ form O( V ) where V = H − G2 /T . [sent-86, score-0.774]

42 However, they consider the case of a known feasible set, and their algorithm (gradient descent with a constant learning rate) cannot obtain bounds of the form we prove. [sent-88, score-0.243]

43 2 Reward and Regret In this section we present a general result that converts lower bounds on reward into upper bounds on regret, for one-dimensional online linear optimization. [sent-89, score-0.522]

44 In the unconstrained setting, this result will be sufficient to provide guarantees for general n-dimensional online convex optimization. [sent-90, score-0.293]

45 Consider an algorithm for one-dimensional online linear optimization that, when run on a sequence of gradients g1 , g2 , . [sent-92, score-0.272]

46 , gT , with gt ∈ [−1, 1] for all t, guarantees Reward ≥ κ exp (γ|g1:T |) − , (1) where γ, κ > 0 and ≥ 0 are constants. [sent-95, score-0.41]

47 Then, against any comparator ˚ ∈ [−R, R], we have x R R log −1 + , (2) γ κγ letting 0 log 0 = 0 when R = 0. [sent-96, score-0.333]

48 Further, any algorithm with the regret guarantee of Eq. [sent-97, score-0.421]

49 The duality between reward and regret can also be seen as a consequence of the fact that exp(x) and y log y − y are convex conjugates. [sent-101, score-0.721]

50 This bound holds for all R, and so for some small R the log term becomes negative; however, for real algorithms the term will ensure the regret bound remains positive. [sent-103, score-0.483]

51 3 Gradient Descent with Increasing Learning Rates In this section we show that allowing the learning rate of gradient descent to sometimes increase leads to novel theoretical guarantees. [sent-105, score-0.216]

52 To build intuition, consider online linear optimization in one dimension, with gradients g1 , g2 , . [sent-106, score-0.212]

53 In this setting, the reward of unconstrained gradient descent has a simple closed form: Lemma 2. [sent-110, score-0.515]

54 Consider unconstrained gradient descent in one dimension, with learning rate η. [sent-111, score-0.288]

55 On T 2 round t, this algorithm plays the point xt = ηg1:t−1 . [sent-112, score-0.281]

56 Letting G = |g1:t | and H = t=1 gt , the cumulative reward of the algorithm is exactly η Reward = G2 − H . [sent-113, score-0.648]

57 Perhaps surprisingly, this result implies that the reward is totally independent of the order of the linear functions selected by the adversary. [sent-115, score-0.275]

58 Examining the expression in Lemma 2, we see that the optimal choice of learning rate η depends fundamentally on two quantities: the absolute value of the sum of gradients (G), and the sum of the squared gradients (H). [sent-116, score-0.197]

59 One of the motivations for this work is the observation that the state-of-the-art online gradient descent algorithms adjust their learning rates based only on the observed value of H (or its upper bound T ); for example [4, 10]. [sent-119, score-0.373]

60 We would like to increase reward by also accounting for G. [sent-120, score-0.278]

61 We suppose for the moment that an upper bound H on H = t=1 gt is known in advance. [sent-125, score-0.409]

62 In the first epoch, we run gradient descent with a small initial learning rate η = η1 . [sent-126, score-0.196]

63 ¯ Whenever the total reward accumulated in the current epoch reaches η H, we double η and start a new epoch (returning to the origin and forgetting all previous gradients except the most recent one). [sent-127, score-0.514]

64 , gT , all in [−1, 1], where H = ¯ H, R EWARD -D OUBLING -1D obtains reward satisfying T xt gt ≥ Reward = √ for a = log(2)/ 3. [sent-132, score-0.755]

65 t=1 1 ¯ |g1:T | η1 H exp a √ ¯ 4 H 4 ¯ − η1 H, T 2 t=1 gt ≤ (3) Algorithm 1 R EWARD -D OUBLING -1D Parameters: initial learning rate η1 , upper T 2 ¯ bound H ≥ t=1 gt . [sent-133, score-0.815]

66 , k let ti denote the round on which Qi is initialized to 0, with t1 ≡ 1, and define tk+1 ≡ T . [sent-165, score-0.183]

67 By construction, Qi is the total reward of a gradient descent algorithm that is active on rounds ti through ti+1 inclusive, and that uses learning rate ηi (note that on round ti , this algorithm gets 0 reward and we initialize Qi to 0 on that round). [sent-166, score-1.057]

68 At the end of round ti+1 − 1, we must have had Qi < ηi H (otherwise ¯ 3H epoch i + 1 would have begun earlier). [sent-171, score-0.163]

69 We can now apply Theorem 1 to the reward (given by Eq. [sent-176, score-0.258]

70 When the feasible set is also fixed in x √ advance, online gradient descent with a fixed learning obtains a regret bound of O(R T ). [sent-179, score-0.736]

71 By choosing η1 = T , we guarantee constant regret against the origin, ˚ = 0 (equivalently, constant total loss). [sent-181, score-0.448]

72 Further, for any feasible set of radius R, we still have x 5 √ worst-case regret of at most O(R T log((1 + R)T )), which is only modestly worse than that of gradient descent with the optimal R known in advance. [sent-182, score-0.597]

73 ¯ The need for an upper bound H can be removed using a standard guess-and-doubling approach, at the cost of a constant factor increase in regret (see appendix for proof). [sent-183, score-0.488]

74 On ¯ each era i, the algorithm runs R EWARD -D OUBLING -1D with an upper bound of Hi = 2i−1 , and i ¯ initial learning rate η1 = 2−2i . [sent-186, score-0.142]

75 An era ends when Hi is no longer an upper bound on the sum of √ 2 squared gradients seen during that era. [sent-187, score-0.163]

76 Letting c = √2−1 , this algorithm has regret at most √ Regret ≤ cR H + 1 log 3. [sent-188, score-0.428]

77 Extension to n dimensions To extend our results to general online convex optimization, it is sufficient to run a separate copy of R EWARD -D OUBLING -1D-G UESS for each coordinate, as is done in R EWARD -D OUBLING (Algorithm 2). [sent-190, score-0.187]

78 The key to the analysis of this algorithm is that overall regret is simply the sum of regret on n one-dimensional subproblems which can be analyzed independently. [sent-191, score-0.739]

79 , fT from Rn to R, R EWARD -D OUBLING with i = n has regret bounded by n i=1 ≤ +c ˚ x for c = √ √ 2 , 2−1 where Hi = n |˚i | Hi + 1 log x Regret(˚) ≤ + c x √ 2 T 2 t=1 gt,i n H + n log T t=1 and H = |˚i |(2Hi + 2)5/2 − 1 x ˚ 2 (2H + 2)5/2 − 1 x 2 gt 2 . [sent-196, score-0.804]

80 For any coordinate i, define x T T ˚i gt,i − x Regreti = t=1 Observe that n T T ˚ · gt − x Regreti = i=1 xt,i gt,i . [sent-199, score-0.38]

81 x t=1 Furthermore, Regreti is simply the regret of R EWARD -D OUBLING -1D-G UESS on the gradient sequence g1,i , g2,i , . [sent-201, score-0.475]

82 76η, (7) η for all T and R, which is better (by constant factors) than Theorem 4 when gt ∈ {−1, 1} (which implies T = H). [sent-214, score-0.39]

83 Because reward changes by gt xt on round t, it suffices to guarantee that for any g ∈ [−1, 1], N (g1:t , t) + gxt+1 ≥ N (g1:t + g, t + 1) (8) where xt+1 is the point the algorithm plays on round t + 1, and we assume N (0, 1) = 0. [sent-217, score-1.03]

84 ∂g g g √ This suggests that if we want to maintain reward at least N (g1:t , t) = 1 (exp(|g1:t |/ t) − 1) , we t xt+1 = √ should set xt+1 ≈ sign(g1:t )t−3/2 exp |g1:t | . [sent-219, score-0.284]

85 Fix a sequence of reward functions ft (x) = gt x with gt ∈ [−1, 1], and let Gt = |g1:t |. [sent-222, score-1.048]

86 We consider S MOOTH -R EWARD -D OUBLING, which plays 0 on round 1 and whenever Gt = 0; otherwise, it plays xt+1 = η sign(g1:t )B(Gt , t + 5) (9) with η > 0 a learning-rate parameter and B(G, t) = 1 t3/2 exp G √ t . [sent-223, score-0.189]

87 (10) Then, at the end of each round t, this algorithm has Reward(t) ≥ η 1 exp t+5 √ Gt t+5 − 1. [sent-224, score-0.148]

88 Consider the problem of unconstrained online linear optimization in one dimension, and an online algorithm that guarantees origin-regret at most . [sent-235, score-0.397]

89 Then, for any fixed comparator ˚, x and any integer T0 , there exists a gradient sequence {gt } ∈ [−1, 1]T of length T ≥ T0 for which the algorithm’s regret satisfies Regret(˚) ≥ 0. [sent-236, score-0.677]

90 Let Q be the algorithm’s reward x when each gt is drawn independently uniformly from {−1, 1}. [sent-240, score-0.607]

91 On this sequence, regret is at least a with √ G˚ − Q ≥ R kT − R T = Ω(R kT ). [sent-248, score-0.359]

92 Consider the problem of unconstrained online linear optimization in Rn , and consider an online algorithm that guarantees origin-regret at most . [sent-250, score-0.397]

93 For any radius R, and any T0 , there exists a gradient sequence gradient sequence {gt } ∈ ([−1, 1]n )T of length T ≥ T0 , and a comparator ˚ with ˚ 1 = R, for which the algorithm’s regret satisfies x x √ n |˚i | T x |˚i | T log x Regret(˚) ≥ 0. [sent-251, score-0.87]

94 For each coordinate i, Theorem 7 implies that there exists a T ≥ T0 and a sequence of gradients gt,i such that √ T T |˚i | T x ˚i gt,i − x xt,i gt,i ≥ 0. [sent-254, score-0.16]

95 ) Summing this inequality across all n coordinates then gives the regret bound stated in the theorem. [sent-257, score-0.413]

96 The following theorem presents a stronger negative result for Follow-the-Regularized-Leader algorithms with a fixed regularizer: for any such algorithm that guarantees origin-regret at most T after T rounds, worst-case regret with respect to any point outside [− T , T ] grows linearly with T . [sent-258, score-0.457]

97 Consider a Follow-The-Regularized-Leader algorithm that sets xt = arg min (g1:t−1 x + ψT (x)) x where ψT is a convex, non-negative function with ψT (0) = 0. [sent-260, score-0.149]

98 Then, for any ˚ with |˚| > T , there exists a x x x sequence of T gradients such that the algorithm’s regret with respect to ˚ is at least T −1 (|˚| − T ). [sent-262, score-0.471]

99 It should be possible to apply our techniques to problems that do have constrained feasible sets; for example, it is natural to consider the unconstrained experts problem on the positive orthant. [sent-265, score-0.177]

100 Optimal strategies and minimax lower bounds for online convex games. [sent-270, score-0.232]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eward', 0.459), ('oubling', 0.459), ('regret', 0.359), ('gt', 0.349), ('reward', 0.258), ('comparator', 0.202), ('xt', 0.128), ('online', 0.11), ('round', 0.101), ('unconstrained', 0.092), ('descent', 0.088), ('qi', 0.087), ('ti', 0.082), ('gradient', 0.077), ('exponentiated', 0.074), ('gti', 0.073), ('regreti', 0.073), ('uess', 0.073), ('gradients', 0.073), ('bounds', 0.066), ('epoch', 0.062), ('convex', 0.056), ('mooth', 0.055), ('wealth', 0.053), ('ft', 0.053), ('hi', 0.052), ('satyen', 0.049), ('bets', 0.049), ('mcmahan', 0.049), ('log', 0.048), ('rn', 0.045), ('hazan', 0.045), ('pr', 0.044), ('feasible', 0.044), ('elad', 0.044), ('theorem', 0.042), ('expert', 0.042), ('experts', 0.041), ('guarantee', 0.041), ('origin', 0.041), ('qk', 0.04), ('sequence', 0.039), ('portfolio', 0.038), ('rounds', 0.038), ('bound', 0.038), ('letting', 0.035), ('guarantees', 0.035), ('betting', 0.032), ('plays', 0.031), ('rate', 0.031), ('coordinate', 0.031), ('guess', 0.03), ('era', 0.03), ('optimization', 0.029), ('radius', 0.029), ('kt', 0.029), ('investment', 0.028), ('brendan', 0.027), ('colt', 0.026), ('exp', 0.026), ('momentum', 0.026), ('management', 0.025), ('kivinen', 0.025), ('appendix', 0.025), ('kale', 0.024), ('constant', 0.024), ('advance', 0.023), ('play', 0.023), ('prediction', 0.022), ('upper', 0.022), ('gs', 0.022), ('algorithm', 0.021), ('regularizer', 0.021), ('copy', 0.021), ('obtains', 0.02), ('fundamentally', 0.02), ('player', 0.02), ('lemma', 0.02), ('cumulative', 0.02), ('increase', 0.02), ('rates', 0.02), ('proof', 0.02), ('fix', 0.019), ('doesn', 0.018), ('receive', 0.018), ('adjust', 0.018), ('double', 0.018), ('implies', 0.017), ('selects', 0.017), ('losses', 0.017), ('matthew', 0.017), ('pt', 0.017), ('sign', 0.016), ('inequality', 0.016), ('matt', 0.016), ('gxt', 0.016), ('invested', 0.016), ('streeter', 0.016), ('loss', 0.016), ('offer', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

2 0.20439611 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

3 0.16828088 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

4 0.16198003 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

5 0.15704088 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

6 0.14720088 293 nips-2012-Relax and Randomize : From Value to Algorithms

7 0.13742816 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

8 0.13292243 60 nips-2012-Bayesian nonparametric models for ranked data

9 0.13177027 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

10 0.12252782 102 nips-2012-Distributed Non-Stochastic Experts

11 0.11753377 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

12 0.11601961 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.11571461 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

14 0.10545138 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

15 0.10293599 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

16 0.10227117 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

17 0.10180202 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

18 0.092907302 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

19 0.08722914 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

20 0.083149388 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.174), (1, -0.168), (2, 0.139), (3, 0.108), (4, 0.07), (5, 0.018), (6, -0.019), (7, -0.01), (8, -0.007), (9, 0.029), (10, 0.118), (11, 0.103), (12, -0.146), (13, -0.029), (14, -0.096), (15, 0.103), (16, -0.104), (17, 0.017), (18, -0.057), (19, -0.074), (20, 0.074), (21, -0.138), (22, 0.012), (23, -0.031), (24, -0.003), (25, 0.003), (26, -0.034), (27, -0.063), (28, 0.003), (29, -0.012), (30, -0.073), (31, -0.105), (32, 0.13), (33, -0.053), (34, -0.038), (35, -0.002), (36, 0.12), (37, -0.014), (38, 0.013), (39, -0.089), (40, -0.005), (41, -0.036), (42, 0.03), (43, -0.04), (44, -0.085), (45, 0.125), (46, -0.08), (47, -0.16), (48, 0.011), (49, -0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93640006 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

2 0.70678109 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

3 0.61263782 295 nips-2012-Risk-Aversion in Multi-armed Bandits

Author: Amir Sani, Alessandro Lazaric, Rémi Munos

Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1

4 0.59576321 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

Author: Ronald Ortner, Daniil Ryabko

Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1

5 0.55457044 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Author: Morteza Ibrahimi, Adel Javanmard, Benjamin V. Roy

Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average √ cost LQ problem, a regret bound of O( T ) was shown, apart form logarithmic factors. However, this bound scales exponentially with p, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present √ an adaptive control scheme that achieves a regret bound of O(p T ), apart from logarithmic factors. In particular, our algorithm has an average cost of (1 + ) times the optimum cost after T = polylog(p)O(1/ 2 ). This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks. 1

6 0.53699851 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

7 0.5289762 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

8 0.52626169 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

9 0.52587736 293 nips-2012-Relax and Randomize : From Value to Algorithms

10 0.50472653 324 nips-2012-Stochastic Gradient Descent with Only One Projection

11 0.50243878 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

12 0.46504048 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

13 0.45075819 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

14 0.45056674 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

15 0.44116759 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

16 0.38704073 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

17 0.37860239 161 nips-2012-Interpreting prediction markets: a stochastic approach

18 0.37618798 292 nips-2012-Regularized Off-Policy TD-Learning

19 0.36939716 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

20 0.36764827 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (17, 0.013), (21, 0.05), (38, 0.165), (42, 0.06), (54, 0.043), (55, 0.011), (57, 0.199), (68, 0.022), (74, 0.038), (76, 0.09), (80, 0.107), (92, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81638664 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

2 0.78950053 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

3 0.74306732 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

4 0.73902655 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

Author: Jake Bouvrie, Jean-jeacques Slotine

Abstract: To learn reliable rules that can generalize to novel situations, the brain must be capable of imposing some form of regularization. Here we suggest, through theoretical and computational arguments, that the combination of noise with synchronization provides a plausible mechanism for regularization in the nervous system. The functional role of regularization is considered in a general context in which coupled computational systems receive inputs corrupted by correlated noise. Noise on the inputs is shown to impose regularization, and when synchronization upstream induces time-varying correlations across noise variables, the degree of regularization can be calibrated over time. The resulting qualitative behavior matches experimental data from visual cortex. 1

5 0.73655891 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

6 0.73623323 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

7 0.73323882 292 nips-2012-Regularized Off-Policy TD-Learning

8 0.73249441 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.73130172 65 nips-2012-Cardinality Restricted Boltzmann Machines

10 0.73118222 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

11 0.73076862 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.72789598 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

13 0.727135 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

14 0.72606546 358 nips-2012-Value Pursuit Iteration

15 0.72569072 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

16 0.72523731 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

17 0.72476488 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

18 0.72374809 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

19 0.72262818 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

20 0.72259736 293 nips-2012-Relax and Randomize : From Value to Algorithms