nips nips2011 nips2011-185 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. [sent-6, score-0.381]
2 We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. [sent-7, score-0.267]
3 We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . [sent-8, score-0.267]
4 For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. [sent-9, score-0.287]
5 Our algorithm is based on a novel application of the online Newton method [HAK07]. [sent-10, score-0.068]
6 ) we only obtain limited feedback about the true label of the input (e. [sent-14, score-0.078]
7 , in recommender systems, we only get feedback on the recommended items). [sent-16, score-0.065]
8 Several such problems can be cast as online, bandit versions of multiclass prediction problems1 . [sent-17, score-0.214]
9 In each round, the learner receives an input x in some high dimensional feature space (the “context”), and produces an action in response, and obtains an associated reward. [sent-19, score-0.064]
10 The goal is to minimize regret with respect to a reference class of policies specifying actions for each context. [sent-20, score-0.25]
11 In this paper, we consider the special case of multiclass prediction, which is a fundamental problem in this area introduced by Kakade et al [KSST08]. [sent-21, score-0.115]
12 Here, a learner obtains a feature vector, which is associated with an unknown label y which can take one of k values. [sent-22, score-0.102]
13 Then the learner produces a prediction of the label, y . [sent-23, score-0.071]
14 The goal is to design an efficient algorithm that minimizes regret with respect to a natural reference class of policies: linear predictors. [sent-25, score-0.266]
15 Kakade et al [KSST08] gave an efficient algorithm, dubbed BANDITRON. [sent-26, score-0.067]
16 Their algorithm attains regret of O(T 2/3 ) for a natural multiclass hinge loss, and they ask the question whether a better regret bound is possible. [sent-27, score-0.664]
17 While the EXP4 √ algorithm [ACBFS03], applied to this setting, has an O( T log T ) regret bound, it is highly inefficient, requiring O(T n/2 ) time per iteration,√ where n is the dimension of the feature space. [sent-28, score-0.305]
18 Ideally, one would like to match or improve the O( T log T ) regret bound of the EXP4 algorithm with an efficient algorithm (for a suitable loss function). [sent-29, score-0.405]
19 In COLT 2009, Abernethy and Rakhlin [AR09] formulated the open question precisely as minimizing regret for a suitable loss function in the fully 1 For the basic bandit classification problem see [DHK07, RTB07, DH06, FKM05, AK08, MB04, AHR08]. [sent-31, score-0.461]
20 In this paper we address this question and design a novel algorithm for the fully adversarial setting with its expected regret measured with respect to log-loss function defined in [AR09], which is parameterized by a scalar α. [sent-36, score-0.358]
21 When α is a constant independent of T , we get a much stronger guarantee than required by the open problem: the regret is bounded by O(log T ). [sent-37, score-0.304]
22 In fact, the regret √ is bounded by O( T ) even for α = Θ(log T ). [sent-38, score-0.267]
23 Our regret bound for larger values of α increases smoothly to a maximum of O(T 2/3 ), matching that of BANDITRON in the worst case. [sent-39, score-0.288]
24 The algorithm is efficient to implement, and it is based on the online Newton method introduced in [HAK07]; hence we call the new algorithm N EWTRON. [sent-40, score-0.084]
25 We implement the algorithm (and a faster variant, PN EWTRON) and test it on the same data sets used by Kakade et al [KSST08]. [sent-41, score-0.058]
26 For any Rn , let 1, 0 denote the all 1s and all 0s vectors respectively, and let I denote the identity matrix in Rn×n . [sent-48, score-0.054]
27 For a matrix W, we denote by W the Frobenius norm of W, which is also the usual 2 norm of the vector form of W, and so the notation is consistent. [sent-64, score-0.09]
28 For two square symmetric matrices W, V of like order, denote by W V the fact that W − V is positive semidefinite, i. [sent-69, score-0.055]
29 A useful fact of the Kronecker product is the following: if W, V are symmetric matrices such that W V, and if U is a positive semidefinite symmetric matrix, then W ⊗U V ⊗U. [sent-72, score-0.06]
30 , T , we are presented a feature vector xt ∈ X , where X ⊆ Rn , and x ≤ R for all x ∈ X . [sent-79, score-0.156]
31 We are required to produce a prediction, yt ∈ [k], as the label ˆ of xt . [sent-82, score-0.423]
32 In response, we obtain only 1 bit of information: whether yt = yt or not. [sent-83, score-0.458]
33 In particular, when ˆ yt = yt , the identity of yt remains unknown (although one label, yt , is ruled out). [sent-84, score-0.916]
34 ˆ ˆ The learner’s hypothesis class is parameterized by matrices W ∈ Rk×n with W ≤ D, for some specified constant D. [sent-85, score-0.054]
35 , Wk , the prediction associated with W for xt is yt = arg max Wi · xt . [sent-90, score-0.592]
36 ˆ i∈[k] While ideally we would like to minimize the 0 − 1 loss suffered by the learner, for computational reasons it is preferable to consider convex loss functions. [sent-91, score-0.108]
37 A natural choice used in Kakade et al [KSST08] is the multi-class hinge loss: (W, (xt , yt )) = max [1 − Wyt · xt + Wi · xt ]+ . [sent-92, score-0.604]
38 T T (Wt , (xt , yt )) − min Regret := W ∈K t=1 (W , (xt , yt )). [sent-96, score-0.458]
39 t=1 A different loss function was proposed in an open problem by Abernethy and Rakhlin in COLT 2009 [AR09]. [sent-97, score-0.066]
40 We choose a constant α which parameterizes the loss function. [sent-99, score-0.063]
41 Suppose we make our prediction yt by sampling from p. [sent-102, score-0.261]
42 ˆ A natural loss function for this scheme is log-loss defined as follows: (W, (x, y)) = − exp(αWy · x) j exp(αWk · x) 1 1 log(py ) = − log α α 1 log j exp(αWj · x) . [sent-103, score-0.124]
43 As α becomes large, this log-loss function has the property that when the prediction given by W for x is correct, it is very close to zero, and when the prediction is incorrect, it is roughly proportional to the margin of the incorrect prediction over the correct one. [sent-105, score-0.112]
44 = −Wy · x + The algorithm and its analysis depend upon the the gradient and Hessian of the loss function w. [sent-106, score-0.062]
45 The following lemma derives these quantities (proof in full version). [sent-110, score-0.071]
46 Then we have (W, (x, y)) = (p − ey ) ⊗ x and 2 (W, (x, y)) = α(diag(p) − pp ) ⊗ xx . [sent-114, score-0.092]
47 4: Let pt = P(Wt , xt ), and set pt = (1 − γ) · pt + γ 1. [sent-129, score-0.582]
48 ˆ 7: if yt = yt then ˆ 1 t (y 8: Define ˜ t := 1−p(yt )t ) · k 1 − eyt ⊗ xt and κt := pt (yt ). [sent-135, score-0.899]
49 pt 9: else p (ˆt ) y 1 10: Define ˜ t := pt (ˆt ) · eyt − k 1 ⊗ xt and κt := 1. [sent-136, score-0.583]
50 ˆ t y 11: end if 12: Define the cost function 1 ft (W) := ˜ t · (W − Wt ) + κt β( ˜ t · (W − Wt ))2 . [sent-137, score-0.124]
51 (3) Wt+1 := arg min ft (W) + W∈K 2D τ =1 14: end for 2. [sent-139, score-0.123]
52 This algorithm is an online version of the Newton step algorithm in offline optimization. [sent-141, score-0.084]
53 The following lemma specifies the algorithm, specialized to our setting, and gives its regret bound. [sent-142, score-0.321]
54 Then the algorithm that, in round t, plays t−1 ft (w) wt := arg min w∈K τ =1 2 has regret bounded by O( nb log( DraT )). [sent-146, score-0.725]
55 a b 3 The N EWTRON algorithm Our algorithm for bandit multiclass learning algorithm, dubbed N EWTRON, is shown as Algorithm 1 above. [sent-147, score-0.239]
56 In each iteration, we randomly choose a label from the distribution specified by the current weight matrix on the current example mixed with the uniform distribution over labels specified by an exploration parameter γ. [sent-148, score-0.081]
57 The parameter γ (which is similar to the exploration parameter used in the EXP3 algorithm of [ACBFS03]) is eventually tuned based on the value of the parameter α in the loss function (see Corollary 5). [sent-149, score-0.081]
58 We then use the observed feedback to construct a quadratic loss function (which is strongly convex) that lower bounds the true loss function in expectation (see Lemma 7) and thus allows us to bound the regret. [sent-150, score-0.191]
59 Furthermore, we also choose a parameter κt , which is an adjustment factor for the strongly convexity of the quadratic loss function ensuring that its expectation lower bounds the true loss function. [sent-152, score-0.113]
60 Finally, we compute the new loss matrix using a Follow-The-Regularized-Leader strategy, by minimizing the sum of all quadratic loss functions so far with 2 regularization. [sent-153, score-0.137]
61 As described in [HAK07], this convex program can be solved in quadratic time, plus a projection on K in the norm induced by the Hessian. [sent-154, score-0.055]
62 To simplify notation, define the function t : K → R as t (W) = (W, (xt , yt )). [sent-156, score-0.229]
63 With this notation, we can state our main theorem giving the regret bound: 1 Theorem 4. [sent-161, score-0.25]
64 Given α, there is a setting of γ so that the regret of N EWTRON is bounded by min c exp(4αRD) log(T ), α 6cRDT 2/3 , where the constant c = O(k 3 n) is independent of α. [sent-166, score-0.284]
65 Our main result as given in Corollary 5 which entails logarithmic regret for constant α, contains a constant which depends exponentially on α. [sent-169, score-0.284]
66 1 Note that even when α grows with T , as long as α ≤ 8RD log(T ), the regret can be bounded as √ O(cRD T ), thus solving the open problem of [KSST08, AR09] for log-loss functions with this range of α. [sent-171, score-0.287]
67 We can say something even stronger - our results provide a “safety net” - no matter what the value of α is, the regret of our algorithm is never worse than O(T 2/3 ), matching the bound of the BAN DITRON algorithm (although the latter holds for the multiclass hinge loss). [sent-172, score-0.414]
68 ) The optimization (3) is essentially running the algorithm from Lemma 3 on 1 K with the cost functions ft (W), with additional nk initial fictitious cost functions 2D (Eil · W)2 for i ∈ [n] and l ∈ [k]. [sent-176, score-0.16]
69 While technically these fictitious cost functions are not necessary to prove our regret bound, we include them since this seems to give better experimental performance and only adds a constant to the regret. [sent-178, score-0.287]
70 We now apply the regret bound of Lemma 3 by estimating the parameters r, a, b. [sent-179, score-0.288]
71 ν Hence, the regret bound of Lemma 3 implies that for any W ∈ K, T ft (Wt ) − ft (W ) = O kn νβ log T . [sent-181, score-0.569]
72 t=1 Note that the bound above excludes the fictitious cost functions since they only add a constant additive term to the regret, which is absorbed by the O(log T ) term. [sent-182, score-0.075]
73 Similarly, we have also suppressed additive constants arising from the log( DraT ) term in the regret bound of Lemma 3. [sent-183, score-0.288]
74 b Taking expectation on both sides of the above bound with respect to the randomness in the algorithm, and using the specification (2) of ft (W) we get 1 2 E ˜ t · (Wt − W ) − κt β( ˜ t · (Wt − W )) 2 5 = O kn νβ log T . [sent-184, score-0.215]
75 T E[ (Wt )] − (W ) = O kn νβ log T + γ log(k) T α . [sent-187, score-0.073]
76 The next lemma shows that in each round, the expected regret of the inner FTAL algorithm with ft cost functions is larger than the regret of N EWTRON. [sent-194, score-0.711]
77 We show that Et [ ˜ t ] = (p − eyt ) ⊗ xt , ˜ ˜ which by Lemma 1 equals t (Wt ). [sent-199, score-0.299]
78 Next, we show that Et [κt t t ] = Ht ⊗ xt xt for some matrix Ht s. [sent-200, score-0.336]
79 By upper bounding Ht , we then show (using Lemma 2) that for any Ψ ∈ K we have 2 βHt ⊗ xt xt . [sent-203, score-0.312]
80 1 − pt (yt ) · E[ ˜ t ] = pt (yt ) · pt (yt ) t 1 1 − eyt k + y=yt (7) pt (y) 1 ⊗ xt pt (y) · · eyt − 1 ˆ pt (y) k = (pt − eyt ) ⊗ xt . [sent-208, score-1.451]
81 E[κt ˜ t ˜ t ] = pt (yt ) · κt t pt (y) · + y=yt 1 − pt (yt ) pt (yt ) pt (y) pt (y) 2 · 2 · ey − =: Ht ⊗ xt xt , 1 1 − eyt k 1 1 k 1 1 − eyt k 1 ey − 1 ⊗ xt xt k (10) 6 where Ht is the matrix in the brackets above. [sent-211, score-1.884]
82 largest eigenvalue) of Ht is bounded as: 1 2 2 1 1 Ht 2 ≤ k 1 − eyt + pt (y) ey − k 1 ≤ 10, (1 − γ)2 y=yt for γ ≤ 1 2. [sent-216, score-0.351]
83 Now, for any Ψ ∈ K, by Lemma 2, for the specified value of β we have αδ 2 Ht ⊗ xt xt . [sent-217, score-0.312]
84 Finally, we have 1 1 (W − Wt ) [ηHt ⊗ xt xt ](W − Wt ) ≤ η Ht ⊗ xt xt 2 W − Wt 2 ≤ 20ηR2 D2 , (13) 2 2 since W − Wt ≤ 2D. [sent-219, score-0.624]
85 4 Experiments While the theoretical regret bound for N EWTRON is O(log T ) when α = O(1), the provable constant in O(·) notation is quite large, leading one to question the practical performance of the algorithm. [sent-221, score-0.321]
86 PN EWTRON does not have the same regret guarantees of N EWTRON however. [sent-241, score-0.25]
87 To derive PN EWTRON, we can restate N EWTRON equivalently as (see [HAK07]): Wt = arg min (W − Wt ) At (W − Wt ) W∈K t−1 ˜ ˜ and bt = t−1 (1 − κτ β ˜ τ · Wτ ) ˜ τ . [sent-242, score-0.055]
88 7 1 where Wt = −A−1 bt , for At = D I + t t−1 bt = τ =1 (1 − κτ β ˜ τ · Wτ ) ˜ τ . [sent-246, score-0.072]
89 For the S YN S EP data set, PN EWTRON very rapidly converges to the lowest possible error rate due to setting the exploration parameter γ = 0. [sent-279, score-0.05]
90 For the R EUTERS 4 data set, both BANDITRON and PN EWTRON decrease the error rate at roughly same pace; however PN EWTRON still obtains better performance consistently by a few percentage points. [sent-292, score-0.056]
91 4 error rate error rate Error rate 10 −1 10 −0. [sent-303, score-0.078]
92 Is it possible to obtain similar regret guarantees for a linear time algorithm? [sent-319, score-0.25]
93 Our regret bound has an exponentially large constant, which depends on the loss functions parameters. [sent-320, score-0.334]
94 Does there exist an algorithm with similar regret guarantees but better constants? [sent-321, score-0.266]
95 Competing in the dark: An efficient algorithm for bandit linear optimization. [sent-330, score-0.125]
96 An efficient bandit algorithm for T -regret in online multiclass prediction? [sent-340, score-0.25]
97 Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. [sent-347, score-0.302]
98 Online convex optimization in the bandit setting: gradient descent without a gradient. [sent-356, score-0.125]
99 Online geometric optimization in the bandit setting against an adaptive adversary. [sent-377, score-0.109]
100 Closing the gap between bandit and full-information online optimization: High-probability regret bound. [sent-380, score-0.411]
wordName wordTfidf (topN-words)
[('ewtron', 0.685), ('wt', 0.294), ('banditron', 0.261), ('regret', 0.25), ('yt', 0.229), ('pn', 0.158), ('xt', 0.156), ('eyt', 0.143), ('pt', 0.142), ('ht', 0.115), ('bandit', 0.109), ('ft', 0.104), ('euters', 0.098), ('ep', 0.089), ('yn', 0.08), ('multiclass', 0.073), ('lemma', 0.071), ('ctitious', 0.065), ('pnewtron', 0.065), ('online', 0.052), ('ftal', 0.049), ('rkn', 0.049), ('ey', 0.049), ('loss', 0.046), ('abernethy', 0.043), ('feedback', 0.04), ('adversarial', 0.039), ('log', 0.039), ('learner', 0.039), ('bound', 0.038), ('label', 0.038), ('kakade', 0.038), ('bt', 0.036), ('kn', 0.034), ('rakhlin', 0.034), ('vt', 0.034), ('rn', 0.033), ('ditron', 0.033), ('drat', 0.033), ('eil', 0.033), ('synnonsep', 0.033), ('prediction', 0.032), ('wk', 0.029), ('satyen', 0.029), ('ban', 0.029), ('wy', 0.029), ('varsha', 0.029), ('elad', 0.028), ('colt', 0.028), ('hazan', 0.026), ('brendan', 0.026), ('kronecker', 0.026), ('rk', 0.026), ('obtains', 0.025), ('round', 0.025), ('recommender', 0.025), ('dubbed', 0.025), ('wi', 0.024), ('matrix', 0.024), ('diag', 0.023), ('newton', 0.023), ('al', 0.023), ('alexander', 0.023), ('xx', 0.022), ('corollary', 0.022), ('dani', 0.022), ('ambuj', 0.022), ('quadratic', 0.021), ('hinge', 0.021), ('pp', 0.021), ('semide', 0.021), ('open', 0.02), ('matrices', 0.02), ('crammer', 0.02), ('cost', 0.02), ('symmetric', 0.02), ('fully', 0.02), ('sham', 0.02), ('et', 0.019), ('arg', 0.019), ('exploration', 0.019), ('soda', 0.019), ('proof', 0.019), ('norm', 0.018), ('jacob', 0.018), ('israel', 0.018), ('bounded', 0.017), ('bandits', 0.017), ('constant', 0.017), ('chose', 0.017), ('parameterized', 0.017), ('question', 0.016), ('convex', 0.016), ('exp', 0.016), ('rate', 0.016), ('algorithm', 0.016), ('incorrect', 0.016), ('denote', 0.015), ('error', 0.015), ('usual', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
2 0.25689733 80 nips-2011-Efficient Online Learning via Randomized Rounding
Author: Nicolò Cesa-bianchi, Ohad Shamir
Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1
3 0.25271094 220 nips-2011-Prediction strategies without loss
Author: Michael Kapralov, Rina Panigrahy
Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1
4 0.23865575 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
5 0.23281616 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
6 0.20954147 145 nips-2011-Learning Eigenvectors for Free
7 0.17687637 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.14245763 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
9 0.14078633 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
10 0.14066131 25 nips-2011-Adaptive Hedge
11 0.13348398 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
12 0.1152861 61 nips-2011-Contextual Gaussian Process Bandit Optimization
13 0.11407811 202 nips-2011-On the Universality of Online Mirror Descent
14 0.10979495 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
15 0.10407932 272 nips-2011-Stochastic convex optimization with bandit feedback
16 0.1004485 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
17 0.10013611 218 nips-2011-Predicting Dynamic Difficulty
18 0.096871249 21 nips-2011-Active Learning with a Drifting Distribution
19 0.09294983 162 nips-2011-Lower Bounds for Passive and Active Learning
20 0.090963863 32 nips-2011-An Empirical Evaluation of Thompson Sampling
topicId topicWeight
[(0, 0.188), (1, -0.319), (2, -0.032), (3, -0.07), (4, 0.293), (5, 0.009), (6, 0.097), (7, 0.022), (8, 0.029), (9, -0.016), (10, 0.143), (11, -0.009), (12, 0.112), (13, 0.012), (14, -0.059), (15, 0.028), (16, 0.012), (17, 0.044), (18, 0.01), (19, -0.035), (20, -0.01), (21, -0.013), (22, 0.059), (23, 0.012), (24, 0.002), (25, -0.012), (26, 0.03), (27, -0.05), (28, -0.059), (29, -0.016), (30, 0.018), (31, 0.065), (32, -0.034), (33, 0.011), (34, -0.015), (35, 0.008), (36, -0.02), (37, 0.019), (38, 0.047), (39, -0.051), (40, -0.007), (41, -0.002), (42, 0.094), (43, -0.048), (44, -0.019), (45, 0.005), (46, -0.016), (47, -0.046), (48, 0.008), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.95615226 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
2 0.8482216 80 nips-2011-Efficient Online Learning via Randomized Rounding
Author: Nicolò Cesa-bianchi, Ohad Shamir
Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1
3 0.82971734 220 nips-2011-Prediction strategies without loss
Author: Michael Kapralov, Rina Panigrahy
Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1
4 0.80495209 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
5 0.75329649 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
6 0.73553938 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
7 0.73330003 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
8 0.61130899 272 nips-2011-Stochastic convex optimization with bandit feedback
9 0.59623712 21 nips-2011-Active Learning with a Drifting Distribution
10 0.54181278 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
11 0.53890026 25 nips-2011-Adaptive Hedge
12 0.53063536 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
13 0.50979108 61 nips-2011-Contextual Gaussian Process Bandit Optimization
14 0.48051229 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
15 0.47880596 202 nips-2011-On the Universality of Online Mirror Descent
16 0.47112015 162 nips-2011-Lower Bounds for Passive and Active Learning
17 0.46791801 32 nips-2011-An Empirical Evaluation of Thompson Sampling
18 0.45131293 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
19 0.43068066 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
20 0.41341844 177 nips-2011-Multi-armed bandits on implicit metric spaces
topicId topicWeight
[(0, 0.046), (4, 0.096), (20, 0.032), (26, 0.037), (31, 0.058), (33, 0.014), (43, 0.064), (45, 0.125), (57, 0.024), (65, 0.012), (66, 0.202), (74, 0.062), (79, 0.02), (83, 0.067), (99, 0.045)]
simIndex simValue paperId paperTitle
1 0.82696456 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
Author: Angela Yao, Juergen Gall, Luc V. Gool, Raquel Urtasun
Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from “simple data”, i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. 1
2 0.81988162 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
Author: Andre S. Barreto, Doina Precup, Joelle Pineau
Abstract: Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL’s model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and fitted Q-iteration. 1
same-paper 3 0.80054867 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
4 0.75188661 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
5 0.74642915 301 nips-2011-Variational Gaussian Process Dynamical Systems
Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou
Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1
6 0.68215948 80 nips-2011-Efficient Online Learning via Randomized Rounding
7 0.67995656 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.67923671 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
9 0.67801833 29 nips-2011-Algorithms and hardness results for parallel large margin learning
10 0.67774439 64 nips-2011-Convergent Bounds on the Euclidean Distance
11 0.67676824 231 nips-2011-Randomized Algorithms for Comparison-based Search
12 0.67554206 22 nips-2011-Active Ranking using Pairwise Comparisons
13 0.67520165 186 nips-2011-Noise Thresholds for Spectral Clustering
14 0.6728887 127 nips-2011-Image Parsing with Stochastic Scene Grammar
15 0.67120981 150 nips-2011-Learning a Distance Metric from a Network
16 0.67009503 139 nips-2011-Kernel Bayes' Rule
17 0.6694417 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
18 0.66765094 265 nips-2011-Sparse recovery by thresholded non-negative least squares
19 0.66748279 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
20 0.66494548 263 nips-2011-Sparse Manifold Clustering and Embedding