jmlr jmlr2011 jmlr2011-14 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. [sent-11, score-0.547]
2 This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. [sent-12, score-0.648]
3 Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. [sent-13, score-0.362]
4 Keywords: multi-armed bandit, regret minimization, online learning 1. [sent-14, score-0.295]
5 2 The sublinear (in T ) regret bound implies that on average, the algorithm’s cost converges to that of the best fixed action in hindsight. [sent-34, score-0.327]
6 √ ˜ Even though the O( T ) dependence on T was a great achievement, this regret bound is weak from the point of view of real-world bandit scenarios. [sent-44, score-0.429]
7 In this paper3 we present the first such bandit optimization algorithm in the worst-case adver√ ˜ sarial setting, with regret bounded by O( Q), where Q is the total observed variation in observed costs, defined as the sum of squared deviations of the cost vectors from their mean. [sent-54, score-0.51]
8 The conjecture that the regret of online learning algorithms should be bounded in terms of the total variation was put forth by Cesa-Bianchi et al. [sent-64, score-0.352]
9 In this paper, we prove the surprising fact that such a regret bound √ ˜ of O( Q) is possible to obtain even when the only information available to the player is the cost she incurred (in particular, we may not even be able to estimate Q accurately in this model). [sent-69, score-0.359]
10 An additional difficulty which arises is the fact that a learning rate parameter η needs to be √ ˜ set based on the total variation Q to obtain the O( Q) regret bound. [sent-76, score-0.298]
11 We consider the online linear optimization model in which iteratively the online player chooses a point xt ∈ K , where K ⊆ Rn is a convex compact set called the decision set. [sent-87, score-0.844]
12 After her choice, an adversary supplies a linear cost function ft , and the player incurs a cost of ft (xt ). [sent-88, score-1.037]
13 1289 H AZAN AND K ALE With some abuse of notation, we use ft to also denote the cost vector such that ft (x) = ft⊤ x. [sent-90, score-0.939]
14 The only information available to the player is the cost incurred, that is, the scalar ft (xt ). [sent-91, score-0.528]
15 The standard game-theoretic measure of performance is regret, defined as T T t=1 RegretT = t=1 min ∑ ft (xt ) − x∈K ∑ ft (x). [sent-93, score-0.892]
16 We assume that the cost vectors are scaled so that their norms are bounded by one, that is, ft ≤ 1. [sent-95, score-0.493]
17 This scaling only changes the regret bound by a constant factor: if G is a known upper bound on the norms of the cost vectors, we can scale down the cost vectors by G and run the algorithm; the actual regret is G times larger than the bound obtained here. [sent-96, score-0.642]
18 We denote by QT the total quadratic variation in cost vectors, that is, T QT := ∑ t=1 ft − µ 2 , 1 T where µ = T ∑t=1 ft is the mean of all cost vectors. [sent-105, score-1.043]
19 For every subsequent element ft , we decide to include it in S with probability k . [sent-129, score-0.446]
20 If the t decision to include it is made, then a random element of S is replaced by ft . [sent-130, score-0.463]
21 , ft } \ S, which gets replaced by ft+1 in the t + 1-th round. [sent-149, score-0.446]
22 Then there exists a polynomial time algorithm for this online linear optimization problem (Algorithm 1 below coupled with the halving procedure of Section 5) whose expected regret is bounded as follows. [sent-203, score-0.309]
23 This theorem can be used with the well known logarithmic barrier to derive regret bounds for the online-shortest-paths problem and other linearly constrained problems, and of course applicable much more generally. [sent-206, score-0.352]
24 √ The additional factor of n factor is because our results assume that ft ≤ 1, and so we need to √ scale the costs down by n to apply our bounds. [sent-214, score-0.467]
25 Here, ˜t is an estimator for the vector ft , which is carefully constructed to have low variance. [sent-222, score-0.462]
26 f In the full-information setting, when we can simply set ˜t = ft , such an algorithm can be shown to f achieve low regret (see exposition in Abernethy et al. [sent-223, score-0.687]
27 , 2008) which produce an unbiased estimator ˜t of ft by evaluating ft at just one point. [sent-227, score-0.93]
28 It essentially performs reservoir sampling on all the coordinates with a reservoir of size k. [sent-251, score-0.428]
29 The point yt chosen 1295 H AZAN AND K ALE Algorithm 1 Bandit Online Linear Optimization 1: Input: η > 0, ϑ-self-concordant R , reservoir size parameter k 2: Initialization: for all i ∈ [n], j ∈ [k], set Si, j = 0. [sent-257, score-0.296]
30 f ˜ 18: end if t 19: Update xt+1 = arg minx∈K η ∑ ˜⊤ x + R (x) fτ τ=1 Φt (x) 20: end for by the algorithm is the corresponding vertex γeit of the (γ-scaled) n-dimensional simplex (which is assumed to be contained inside of K ) to obtain the coordinate ft (it ) as the cost. [sent-276, score-0.479]
31 , Sit ,k uniformly at random and replaces it with the value ft (it ), and updates µt . [sent-280, score-0.446]
32 The point yt chosen by the algorithm is uniformly at random chosen from the endpoints of the principal axes of the Dikin ellipsoid W1 (xt ) centered at xt . [sent-284, score-0.853]
33 7: end if 8: Update the sample Sit , j = 1 ft (it ). [sent-298, score-0.446]
34 4: Observe the cost ft t 5: Return ˜t defined as: f ˜t = µt + gt ˜ f ˜ 1/2 ˜ Where gt := n ft⊤ yt − µt⊤ yt εt λit vit . [sent-317, score-0.859]
35 For t > nk the claim follows from the properties of reservoir sampling, as we show now. [sent-333, score-0.304]
36 Then, at time t = nk + 1 and onwards, reservoir sampling performs select-and-replace with probability k (i. [sent-336, score-0.32]
37 , it selects ft (i) with probability k and replaces a random element of the pret t vious Si with it). [sent-338, score-0.446]
38 Analysis In this section, we prove a regret bound, in a slightly easier setting where we know an upper bound Q on the total variation QT . [sent-342, score-0.32]
39 We first relate the expected regret of Algorithm 1 which plays the points yt , for t = 1, 2, . [sent-349, score-0.34]
40 with the ft cost vectors to the expected regret of another algorithm that plays the points xt with the ˜t cost f vectors. [sent-352, score-1.434]
41 t=1 ˜ Intuitively, this bound holds since in every E LLIPSOID S AMPLE step, the expectation of ft and yt (conditioned on all previous randomization) are ft and xt respectively, the expected costs for both algorithms is the same in such rounds. [sent-354, score-1.687]
42 , ˜T ∈ Rn , the FTRL algorithm with a ϑ-self conf f cordant barrier R has the following regret guarantee: for any u ∈ K , we have T f ∑ ˜t⊤ (xt − u) ≤ t=1 T 2 f ∑ ˜t⊤ (xt − xt+1 ) + η ϑ log T. [sent-361, score-0.347]
43 Then we have ˜t⊤ (xt − xt+1 ) ≤ 64ηn2 ft − µt f 2 + 64ηn2 µt − µt ˜ 1298 2 + 2µt⊤ (xt − xt+1 ). [sent-366, score-0.446]
44 Trivially, since we set ˜t = 0 in such steps, we have xt = xt+1 . [sent-368, score-0.653]
45 f By adding the non-negative term 64ηn2 ft − µt 2 , we get that for any S IMPLEX S AMPLE step t, we have ˜t⊤ (xt − xt+1 ) ≤ 64ηn2 ft − µt 2 + 2µt⊤ (xt − xt+1 ). [sent-370, score-0.907]
46 Summing up either (4) or (5), as the case may be, over all time periods t we get T T ≤ 64ηn2 ∑ ft − µt f ∑ ˜t⊤ (xt − xt+1 ) 2 + 64ηn2 t∈TE t=1 t=1 ∑ µt − µt ˜ 2 T + 2 ∑ µt (xt − xt+1 ) (6) t=1 We bound each term of the inequality (6) above separately. [sent-372, score-0.483]
47 ˜ ˜ 2 ε ∈{1,−1} ∑ t Hence, n 1 · n((ft − µt )⊤ vi )vi = ft − µt , ˜ ˜ n i=1 Et [˜ t ] = ∑ g ˜ since the vi form an orthonormal basis. [sent-393, score-0.446]
48 g ˜ Furthermore, it is easy to see that Et [yt ] = xt , since yt is drawn from a symmetric distribution centered at xt (namely, the uniform distribution on the endpoints of the principal axes of the Dikin ellipsoid centered at xt ). [sent-395, score-2.175]
49 In this case, we have |ft⊤ (yt − f u) ≤ ft yt − u ≤ 2, and ˜t⊤ (xt − u) = 0 since ˜t = 0. [sent-398, score-0.545]
50 Lemma 13 For any time period t ≥ nk, the next minimizer xt+1 is “close” to xt : xt+1 ∈ W 1 (xt ). [sent-411, score-0.672]
51 2 Proof If t is a S IMPLEX S AMPLE step, then xt = xt+1 and the lemma is trivial. [sent-412, score-0.689]
52 Now, recall that xt+1 = arg min Φt (x) and xt = arg min Φt−1 (x), x∈K x∈K ˜ where Φt (x) = η ∑ts=1 ft⊤ x + R (x). [sent-414, score-0.689]
53 Since the barrier function R goes to infinity as we get close to the boundary, the points xt and xt+1 are both in the interior of K . [sent-415, score-0.778]
54 2 First, note that since xt is in the interior of K , the first order optimality condition gives ∇Φt−1 (xt ) = 0, and we conclude that ∇Φt (xt ) = η˜t . [sent-417, score-0.688]
55 Now consider any point in z on the boundary of W 1 (xt ), f 2 that is, y = xt + h for some vector h such that h we get xt = 1 . [sent-418, score-1.34]
56 Using the multi-variate Taylor expansion, 2 1 1 Φt (y) = Φt (xt + h) = Φt (xt ) + ∇Φt (xt )⊤ h + h⊤ ∇2 Φt (ξ)h = Φt (xt ) + η˜t⊤ h + h⊤ ∇2 Φt (ξ)h f 2 2 (7) for some ξ on the line segment between xt and xt + h. [sent-419, score-1.306]
57 This latter fact also implies that ξ − xt xt ≤ 1 h xt ≤ 2 . [sent-420, score-1.959]
58 Hence, by (3), ∇2 R (ξ) (1 − ξ − xt xt ) 2 1301 ∇2 R (xt ) 1 2 ∇ R (xt ). [sent-421, score-1.306]
59 4 H AZAN AND K ALE Thus h⊤ ∇2 R (ξ)h ≥ 1 4 h xt 1 = 8 . [sent-422, score-0.653]
60 Next, we bound |˜t⊤ h| as follows: f |˜t⊤ h| ≤ ˜t f f Claim 2 ˜t f ⋆ xt ⋆ xt h xt ≤ 1 ˜ ft 2 ⋆ xt . [sent-423, score-3.08]
61 We have f ˜ ˜ ˜ gt ⋆2 xt 1/2 = n (ft − µt )⊤ yt εt λit vit ˜ 2 = n2 (ft − µt )⊤ yt ˜ ⊤ 1/2 [∇2 R (xt )]−1 n (ft − µt )⊤ yt εt λit vit ˜ , since v⊤ [∇2 R (xt )]−1 vit = 1/λit . [sent-426, score-1.259]
62 Hence, it ˜t f since µt ˜ ⋆ xt ⋆ xt ≤ µt ˜ ⋆ xt ˜ + gt ⋆ xt ≤ µt + n|(ft − µt )⊤ yt | ≤ 3n, ˜ ˜ ≤ µt ≤ 1. [sent-427, score-2.75]
63 We also used the facts that yt ≤ 1 and ft − µt ≤ 2. [sent-428, score-0.545]
64 Lemma 14 For any time period t ≥ nk, we have xt − xt+1 2 xt ≤ 4η˜t⊤ (xt − xt+1 ). [sent-431, score-1.325]
65 Thus, we have xt+1 − xt 2 zt = Φt (xt ) − Φt (xt+1 ) = Φt−1 (xt ) − Φt−1 (xt+1 ) + η˜t⊤ (xt − xt+1 ) ≤ η˜t⊤ (xt − xt+1 ), f f since xt is the minimizer of Φt−1 in K . [sent-434, score-1.34]
66 It remains to show that 1 xt+1 − xt 2t ≤ xt+1 − xt 2t , for z x 4 1 which it suffices to show 4 ∇2 R (xt ) ∇2 R (zt ). [sent-435, score-1.306]
67 By Lemma 13 we have xt+1 ∈ W1/2 (xt ), and hence zt ∈ W1/2 (xt ) (since zt is on the line segment between xt and xt+1 ). [sent-436, score-0.721]
68 4 1302 2 zt , B ETTER A LGORITHMS FOR B ENIGN BANDITS Proof [Lemma 9] First, we have (˜t − µt )⊤ (xt − xt+1 ) ≤ ˜t − µt f f ⋆ xt ≤ ˜t − µt f · xt − xt+1 ⋆ xt ≤ 2η ˜t − µt f (by (1)) xt 4η˜t⊤ (xt − xt+1 ) f · ⋆2 xt + (Lemma 14) 1 ˜⊤ f (xt − xt+1 ). [sent-438, score-3.299]
69 Simplifying, we get that ˜t⊤ (xt − xt+1 ) ≤ 4η ˜t − µt ⋆2 + 2µt⊤ (xt − xt+1 ) f f xt ˜t − µt ⋆2 + µt − µt ⋆2 + 2µt⊤ (xt − xt+1 ) ≤ 8η f ˜ x ˜ x t ≤ 32η ˜ xt gt ⋆2 + t µt − µt ˜ 2 + 2µt⊤ (xt − xt+1 ). [sent-440, score-1.36]
70 ˜ Using the definition of gt from Algorithm 3, we get that ˜ gt ⋆2 xt = n2 (ft − µt )⊤ yt ˜ = n2 (ft − µt )⊤ yt ˜ ≤ n2 ft − µt ˜ 2 λit · v⊤ [∇2 R (xt )]−1 vit it 2 2 ≤ 2n2 [ ft − µt 2 + µt − µt ˜ 2 ]. [sent-442, score-1.926]
71 Plugging this bound into the previous bound we conclude that ˜t⊤ (xt − xt+1 ) ≤ 64ηn2 ft − µt f 2 + 64ηn2 µt − µt ˜ 2 + 2µt⊤ (xt − xt+1 ). [sent-444, score-0.49]
72 As a first step, we show that T ∑ τ=1 ft − µt 2 T ≤ ∑ τ=1 1303 ft − µT 2 . [sent-446, score-0.892]
73 Moving to T , we have T ∑ t=1 ft − µt 2 = ≤ ≤ = T −1 ∑ ft − µt ∑ ft − µT −1 ∑ ft − µT ∑ ft − µT t=1 T −1 t=1 T −1 t=1 T t=1 2 2 2 2 + fT − µT 2 + fT − µT 2 (By inductive hypothesis) T −1 (µT −1 = arg minx ∑t=1 ft − x 2 ) 2 + fT − µT . [sent-450, score-2.723]
74 Hence, T ∑ τ=1 2 ft − µt T ≤ ∑ τ=1 ft − µ 2 = QT . [sent-451, score-0.892]
75 Proof [Lemma 11] Any E LLIPSOID S AMPLE step t must have t ≥ nk, so by Claim 1 the algorithm exactly implements reservoir sampling with a reservoir of size k for each of the n coordinates. [sent-452, score-0.428]
76 Thus, since xt ≤ 1 and µt ≤ 1, we have T T ∑ ∑ µt⊤ (xt − xt+1 ) ≤ µt+1 − µt + 4. [sent-465, score-0.653]
77 Arguing as in Lemma 10, we have ∑ xt2 T = t ∑ t=1 ft − µt 2 T ≤ ∑ t=1 ft − µ 2 ≤ QT . [sent-469, score-0.892]
78 Notice that µt − µt−1 = 1 t−1 1 1 1 t 1 1 t−1 fτ − fτ = ft + ( − ) ∑ fτ = (ft − µt−1 ). [sent-470, score-0.446]
79 ∑ t −1 ∑ t τ=1 t t t − 1 τ=1 t τ=1 Hence, µt − µt−1 = 1 ft − µt−1 t ≤ 1 1 xt + µt − µt−1 , t t from which we conclude that for all t ≥ 2 we have µt − µt−1 ≤ Lemma 16 to conclude that t −1 x 1−t −1 t ≤ ∑t 2 xt . [sent-471, score-1.752]
80 The first lemma gives a general regret bound for any follow-the-regularized-leader style algorithm. [sent-476, score-0.299]
81 Lemma 15 Consider an online linear optimization instance over a convex set K , with a regularization function R and a sequence {xt } defined by t−1 xt = arg min x∈K ∑ f⊤ x + R (x) τ . [sent-478, score-0.756]
82 τ=1 For every u ∈ K , the sequence {xt } satisfies the following regret guarantee T ∑ ftT (xt − u) ≤ t=1 T 1 ∑ ft (xt − xt+1 ) + η [R (u) − R (x1 )]. [sent-479, score-0.687]
83 Now assume that that for some T ≥ 1, we have T T ∑ ft (xt ) − ft (u) t=0 ≤ ∑ ft (xt ) − ft (xt+1 ). [sent-483, score-1.784]
84 We conclude that T T ∑ ft (xt ) − ft (u) t=1 ≤ = ∑ ft (xt ) − ft (xt+1 ) + [−f0 (x0 ) + f0 (u) + f0(x0 ) − f0 (x1 )] t=1 T 1 ∑ ft (xt ) − ft (xt+1 ) + η [R (u) − R (x1 )] . [sent-487, score-2.676]
85 T Proof By Lemma 17 below, the values of xt that maximize ∑t=1 1 xt must have the following t structure: there is a k such that for all t ≤ k, we have xt = 1, and for any index t > k, we have k xk+1 /xt ≥ 1 / 1 , which implies that xt ≤ k/t. [sent-493, score-2.612]
86 Now, k t we can bound the value as follows: T 1 ∑ t xt t=1 k ≤ 1 T k 1 ≤ log(k + 1) + k · ≤ log(Q + 1) + 1. [sent-495, score-0.675]
87 Typically, in online learning scenarios where a regret bound √ of O( AT ) for some quantity AT which grows with T is desired, one first gives an online learning algorithm L(η) where η ≤ 1 is a learning rate parameter which obtains a regret bound of RegretT ≤ ηAT + O(1/η). [sent-520, score-0.649]
88 √ Then, we can obtain a master online learning algorithm whose regret grows like O( AT ) as follows. [sent-521, score-0.328]
89 ηi ≥ Thus, phase i ends at time period t − 1, and the point xt computed by L(ηi ) is discarded by the master algorithm since L(ηi+1 ) starts at this point and xt is reset to the initial point of L(ηi+1 ). [sent-542, score-1.394]
90 ηi⋆ −1 ηi⋆ ≥ Applying the bound (6), and using the fact that ηi⋆ −1 = 2ηi⋆ , we get f ∑ ˜t⊤ (xt − xt+1 ) t∈J ≤ 128ηi⋆ n2 ∑ ft − µt 2 + 128ηi⋆ n2 t∈J ∑ t∈JE µt − µt ˜ 2 + 2 ∑ µt⊤ (xt − xt+1 ). [sent-560, score-0.483]
91 t∈J Putting these together, and dividing by ηi⋆ , we get 1 ϑ log(T ) ≤ 128n2 ∑ ft − µt η2⋆ t∈J i 2 + 128n2 ∑ t∈JE 1308 µt − µt ˜ 2 + 2 ∑ µ⊤ (xt − xt+1 ). [sent-561, score-0.461]
92 ηi⋆ t∈J t (9) B ETTER A LGORITHMS FOR B ENIGN BANDITS Lemmas 10 and 12 give us the following upper bounds: ∑ t∈J ft − µt 2 ≤ QT and ∑ µt⊤ (xt − xt+1 ) t∈J ≤ 2 log(QT + 1) + 4. [sent-562, score-0.446]
93 Conclusions and Open Problems In this paper, we gave the first bandit online linear optimization algorithm whose regret is bounded by the square-root of the total quadratic variation of the cost vectors. [sent-572, score-0.564]
94 4 This algorithm continues a line of work which aims to prove variation-based regret bounds for any online learning framework. [sent-574, score-0.317]
95 So far, such bounds have been obtained for four major online learning scenarios: expert prediction, online linear optimization, portfolio selection (and exp-concave cost functions), and bandit online linear optimization in this paper. [sent-575, score-0.41]
96 The concept of variational regret bounds in the setting of the ubiquitous multi-armed bandit problem opens many interesting directions for further research and open questions: 1. [sent-576, score-0.414]
97 In the stochastic multi-armed bandit setting, the regret is known to be bounded by a logarithm in the number of iterations rather than square root (Auer et al. [sent-579, score-0.392]
98 For the special case of the classic non-stochastic MAB problem, obtain regret bounds which depend on the variation of the best action in hindsight (vs. [sent-584, score-0.321]
99 Is it possible to improve regret for the classic non-stochastic multi-armed bandit problem without using the self-concordance methodology (perhaps by extending the algorithm in Hazan and Kale (2008) to the bandit setting)? [sent-587, score-0.543]
100 Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. [sent-665, score-0.309]
wordName wordTfidf (topN-words)
[('xt', 0.653), ('ft', 0.446), ('regret', 0.241), ('qt', 0.201), ('reservoir', 0.197), ('ample', 0.167), ('bandit', 0.151), ('implex', 0.115), ('llipsoid', 0.115), ('yt', 0.099), ('ale', 0.098), ('enign', 0.098), ('vit', 0.09), ('nk', 0.089), ('azan', 0.084), ('etter', 0.084), ('bandits', 0.075), ('barrier', 0.075), ('dikin', 0.074), ('ellipsoid', 0.07), ('hazan', 0.063), ('traf', 0.063), ('abernethy', 0.056), ('online', 0.054), ('cost', 0.047), ('lgorithms', 0.041), ('regrett', 0.041), ('variation', 0.041), ('gt', 0.039), ('phase', 0.036), ('lemma', 0.036), ('player', 0.035), ('interior', 0.035), ('kale', 0.035), ('sampling', 0.034), ('zt', 0.034), ('master', 0.033), ('replacement', 0.033), ('log', 0.031), ('stream', 0.029), ('minx', 0.029), ('te', 0.029), ('kt', 0.029), ('dani', 0.028), ('auer', 0.027), ('lemmas', 0.027), ('je', 0.025), ('round', 0.025), ('ellipsoidal', 0.025), ('var', 0.023), ('hide', 0.023), ('day', 0.023), ('sit', 0.023), ('bounds', 0.022), ('unbiased', 0.022), ('bound', 0.022), ('kalai', 0.021), ('nemirovskii', 0.021), ('flaxman', 0.021), ('streaming', 0.021), ('costs', 0.021), ('period', 0.019), ('barriers', 0.019), ('routes', 0.019), ('boundary', 0.019), ('arg', 0.018), ('claim', 0.018), ('decision', 0.017), ('trick', 0.017), ('action', 0.017), ('convex', 0.017), ('awerbuch', 0.016), ('commuting', 0.016), ('eit', 0.016), ('minkowsky', 0.016), ('satyen', 0.016), ('vitter', 0.016), ('estimator', 0.016), ('adversary', 0.016), ('robbins', 0.016), ('route', 0.016), ('proof', 0.016), ('total', 0.016), ('centered', 0.016), ('dependence', 0.015), ('randomness', 0.015), ('get', 0.015), ('doesn', 0.015), ('invoked', 0.015), ('incur', 0.015), ('axes', 0.015), ('obtains', 0.015), ('ti', 0.015), ('inside', 0.015), ('incurred', 0.014), ('philadelphia', 0.014), ('spent', 0.014), ('optimization', 0.014), ('logarithmic', 0.014), ('portfolio', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000024 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
2 0.47005874 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
3 0.17368273 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
4 0.15784511 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
5 0.13948146 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
6 0.087120853 36 jmlr-2011-Generalized TD Learning
7 0.0738234 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
8 0.069722287 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
9 0.068967715 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
10 0.066660851 35 jmlr-2011-Forest Density Estimation
11 0.059048586 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
12 0.056199875 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
13 0.052602787 101 jmlr-2011-Variable Sparsity Kernel Learning
14 0.052596617 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
15 0.045657393 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
16 0.042885128 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
17 0.034409668 91 jmlr-2011-The Sample Complexity of Dictionary Learning
18 0.032945022 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
19 0.0291839 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
20 0.028568555 22 jmlr-2011-Differentially Private Empirical Risk Minimization
topicId topicWeight
[(0, 0.224), (1, 0.419), (2, -0.159), (3, 0.011), (4, 0.276), (5, 0.157), (6, 0.496), (7, 0.057), (8, 0.003), (9, -0.172), (10, 0.025), (11, 0.047), (12, -0.057), (13, -0.062), (14, -0.112), (15, 0.034), (16, 0.019), (17, -0.037), (18, -0.015), (19, 0.084), (20, -0.092), (21, 0.027), (22, 0.019), (23, 0.036), (24, 0.008), (25, -0.048), (26, 0.028), (27, 0.012), (28, -0.021), (29, -0.024), (30, 0.013), (31, -0.026), (32, 0.065), (33, 0.044), (34, 0.034), (35, 0.027), (36, -0.006), (37, 0.002), (38, 0.012), (39, 0.017), (40, -0.023), (41, 0.017), (42, 0.011), (43, -0.032), (44, -0.007), (45, 0.01), (46, -0.011), (47, 0.002), (48, -0.026), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.99168861 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
2 0.94090194 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
3 0.51746196 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
4 0.43616641 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
5 0.42600226 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
6 0.2824679 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
7 0.24960361 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
8 0.2371911 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
9 0.23404831 36 jmlr-2011-Generalized TD Learning
10 0.19644314 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
11 0.16394785 35 jmlr-2011-Forest Density Estimation
12 0.15190437 101 jmlr-2011-Variable Sparsity Kernel Learning
13 0.14431497 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
14 0.14361565 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
15 0.12996688 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
16 0.12542784 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
17 0.12420117 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
18 0.12258469 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
19 0.12153323 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
20 0.11414259 22 jmlr-2011-Differentially Private Empirical Risk Minimization
topicId topicWeight
[(4, 0.037), (9, 0.031), (10, 0.028), (24, 0.04), (31, 0.078), (32, 0.013), (41, 0.025), (60, 0.014), (63, 0.016), (65, 0.015), (67, 0.459), (70, 0.014), (73, 0.024), (78, 0.076), (90, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.80227518 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
2 0.68802088 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
3 0.32821074 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
4 0.32109353 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
5 0.3011308 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
6 0.2973218 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
7 0.29423603 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
8 0.29004875 36 jmlr-2011-Generalized TD Learning
9 0.2822516 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
10 0.28178328 28 jmlr-2011-Double Updating Online Learning
11 0.27915651 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
12 0.26849529 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
13 0.26732552 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
14 0.26522589 91 jmlr-2011-The Sample Complexity of Dictionary Learning
15 0.26363054 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
16 0.26329544 12 jmlr-2011-Bayesian Co-Training
17 0.26277485 16 jmlr-2011-Clustering Algorithms for Chains
18 0.2625806 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
19 0.26123208 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
20 0.26061136 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates