jmlr jmlr2010 jmlr2010-97 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
Reference: text
sentIndex sentText sentNum sentScore
1 In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. [sent-7, score-0.528]
2 γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. [sent-8, score-0.492]
3 Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al. [sent-10, score-0.435]
4 Introduction This section starts by defining the prediction tasks, the different regret notions that we will consider, and the different adversaries of the forecaster. [sent-14, score-0.38]
5 1 The Four Prediction Tasks We consider a general prediction game where at each stage, a forecaster (or decision maker) chooses one action (or arm), and receives a reward from it. [sent-17, score-0.687]
6 Then the forecaster receives a feedback about the rewards which he can use to make his choice at the next stage. [sent-18, score-0.367]
7 In the simplest version, after choosing an arm the forecaster observes the rewards ∗. [sent-20, score-0.534]
8 This prediction game is the label efficient game, – only gIt ,t in the bandit game, – only his obtained reward gIt ,t if he asks for it with the global constraint that he is not allowed to ask it more than m times for some fixed integer number 1 ≤ m ≤ n. [sent-43, score-0.833]
9 This prediction game is the label efficient bandit game. [sent-44, score-0.72]
10 In the label efficient game, originally proposed by Helmbold and Panizza (1997), after choosing its action at a stage, the forecaster decides whether to ask for the rewards of the different actions at this stage, knowing that he is allowed to do it a limited number of times. [sent-48, score-0.398]
11 Another classical setting is the bandit game where the forecaster only observes the reward of the arm he has chosen. [sent-49, score-1.209]
12 In its original version (Robbins, 1952), this game was considered in a stochastic setting, that is, the nature draws the rewards from a fixed productdistribution. [sent-50, score-0.461]
13 A combination of the two previous settings is the label efficient bandit game (Gy¨ rgy and Ottucs´ k, 2006), in which the only observed rewards o a are the ones obtained and asked by the forecaster, with again a limitation on the number of possible queries. [sent-53, score-0.866]
14 A lot of attention has been drawn by the characterization of the minimax expected regret in the different games we have described. [sent-60, score-0.392]
15 More precisely for a given game, let us write sup for the supremum over all allowed adversaries and inf for the infimum over all forecaster strategies for this game. [sent-61, score-0.556]
16 We are interested in the quantity: inf sup ERn , where the expectation is with respect to the possible randomization of the forecaster and the adversary. [sent-62, score-0.406]
17 An even more general adversary is the oblivious one, in which the only constraint on the adversary is that the reward vectors are independent of the past decisions of the forecaster. [sent-80, score-0.504]
18 4 Known Regret Bounds Table 1 recaps existing lower and upper bounds on the minimax pseudo-regret and the minimax expected regret for general adversaries (i. [sent-86, score-0.511]
19 (2002b) for the bandit game and Gy¨ rgy and o Ottucs´ k (2006) for the label efficient bandit game. [sent-92, score-1.094]
20 Table 1 exhibits several logarithmic gaps between upper and lower bounds on the minimax rate, namely: • log(K) gap for the minimax pseudo-regret in the bandit game as well as the label efficient bandit game. [sent-105, score-1.23]
21 • log(n) gap for the minimax expected regret in the bandit game as well as the label efficient bandit game. [sent-106, score-1.344]
22 The analysis is original (it avoids the traditional but scope-limiting argument based on the simplification of a sum of logarithms of ratios), and allows to fill in the long open gap in the bandit problems with oblivious adversaries (and with general adversaries for the pseudo-regret notion). [sent-111, score-0.752]
23 Bounds on the expected regret that are deduced from these PAC (“probably approximately correct”) regret bounds are better than previous bounds by a logarithmic factor in the games with limited information (see columns on inf sup ERn in Tables 1 and 2). [sent-115, score-0.837]
24 bandit with deterministic adversary n K m n K m L. [sent-120, score-0.536]
25 bandit with oblivious adversary n K m n K m n K m log K log(Kδ−1 ) L. [sent-122, score-0.751]
26 bandit with general adversary n K m K log K m n K m log K log(Kδ−1 ) nK nK √ n nK nK log K log(Kδ−1 ) nKlog K nK log K log(Kδ−1 ) n K m log(δ−1 ) Table 2: New regret upper bounds proposed in this work. [sent-124, score-1.389]
27 Another “orthogonal” contribution is the proposal of a new biased estimate of the rewards in bandit games, which allows to achieve high probability regret bounds depending on the performance n of the optimal arm: in this new bound, the factor n is replaced by Gmax = maxi=1,. [sent-126, score-0.753]
28 First it gives the first lower bound for the label efficient bandit game. [sent-140, score-0.413]
29 Secondly in the case of the label efficient (full information) game it is a simpler proof than the one proposed in Cesa-Bianchi et al. [sent-141, score-0.397]
30 The last contribution of this work is also independent of the previous ones and concerns the stochastic bandit game (that is the bandit game with “stochastic” adversary). [sent-148, score-1.378]
31 The interest of Poly INF appears in the bandit games where it satisfies a regret bound without a logarithmic factor, unlike exponentially weighted average forecasters. [sent-159, score-0.817]
32 Section 6 provides high probability bounds in the bandit games that depends on the cumulative reward of the n optimal arm: the factor n is replaced by max1≤i≤K ∑t=1 gi,t . [sent-160, score-0.673]
33 1 of Cesa-Bianchi and Lugosi (2006), where the probabilities are proportional to a function of the difference between the (estimated) cumulative reward of arm i and the cumulative reward of the policy, which should be, for a well-performing policy, of order C(Vt ). [sent-213, score-0.479]
34 The interesting feature of the implicit normalization is the following argument, which allows to recover the results concerning the exponentially √ weighted average forecasters, and more interestingly to propose a policy having a regret of order nK in the bandit game with oblivious adversary. [sent-214, score-1.113]
35 In i=1 1 I =i I fact, it is exactly the cumulative gain in the bandit game when vi,t = gi,t pti,t , and its expectation is exactly the expected cumulative reward in the full information game when vi,t = gi,t . [sent-216, score-1.222]
36 The equality is interesting since, from (2), Cn approximates the maximum estimated cumulative reward max1≤i≤K Vi,n , which should be close to the cumulative reward of the n optimal arm max1≤i≤K Gi,n , where Gi,n = ∑t=1 gi,t . [sent-221, score-0.479]
37 We will prove that Poly INF forecaster generates nicer probability updates than the exponentially weighted average forecasteras as, for bandits games (label efficient or not), it allows to remove the extraneous log K factor in the pseudo-regret bounds and some regret bounds. [sent-236, score-0.877]
38 K AUDIBERT AND B UBECK In other words, for the full information case (label efficient or not), we recover the exponentially weighted average forecaster (with γ = 0) while for the bandit game we recover EXP3. [sent-256, score-0.985]
39 For the label efficient bandit game, it does not give us the GREEN policy proposed in Allenberg et al. [sent-257, score-0.421]
40 (2006) but rather the straightforward modification of the exponentially weighted average forecaster to this game (Gy¨ rgy and Ottucs´ k, 2006). [sent-258, score-0.649]
41 The Full Information (FI) Game The purpose of this section is to illustrate the general regret bounds given in Theorems 2 and 3 in the simplest case, when we set vi,t = gi,t , which is possible since the rewards for all arms are observed in the full information setting. [sent-278, score-0.449]
42 η t=1 i=1 (14) 8 log K n , log K, and there exists η > 0 such that n n max 1≤i≤K i=1 and n n max 1≤i≤K i=1 In particular with η = K ∑ gi,t − ∑ ∑ pi,t gi,t ≤ t=1 i=1 K t=1 i=1 we get ERn ≤ ERn ≤ n 2 2EGmax log K. [sent-284, score-0.534]
43 Indeed, by taking the expectation in (14), we get n K E ∑ ∑ pi,t gi,t ≥ t=1 i=1 ηEGmax − log K = log 1 + exp(η) − 1 2 log K EGmax ≥ EGmax − 2 (EGmax )3 − 2 log K EGmax log K 2 EGmax log K , 2 2 where we use log(1 + x) ≥ x − x2 for any x ≥ 0 in the last inequality. [sent-288, score-0.9]
44 The Limited Feedback Games This section provides regret bounds for three limited feedback games: the label efficient game, the bandit game, and the mixed game, that is the label efficient bandit game. [sent-301, score-1.043]
45 1 Label Efficient Game (LE) The variants of the LE game consider that the number of queried reward vectors is constrained either strictly or just in expectation. [sent-303, score-0.447]
46 Note that we do not fulfill exactly the requirement of the LE game as we might ask a bit more than m reward vectors, but we fulfill the one of the simple LE game. [sent-310, score-0.447]
47 η A similar result can be proved for the INF forecaster with ψ(x) = −x , η > 0 and q of order log K. [sent-322, score-0.39]
48 These results are interesting for oblivious opponents, that is when the adversary’s choices of the rewards do not depend on the past draws and obtained rewards, since in this case Proposition 33 in Appendix D shows that one can extend bounds on the pseudo-regret Rn to the expected regret ERn . [sent-328, score-0.499]
49 Theorem √ (Exponentially weighted average forecaster in the LE game) Let ψ(x) = exp(ηx) 8 gi,t log with η = m n K . [sent-342, score-0.41]
50 gi,t I pi,t 1 It =i , which is observable since the reward gIt ,t is Theorem 10 (Exponentially weighted average forecaster in the bandit game) Let ψ(x) gi,t γ 4ηK I exp(ηx) + K with 1 > γ ≥ 5 > 0. [sent-363, score-0.728]
51 Then in the bandit game, INF satisfies: Rn ≤ In particular, for γ = min 1 2, log K + γ max EGi,n . [sent-365, score-0.547]
52 1≤i≤K η and η = 4K log K 5n = 5 log K 4nK , we have 16 nK log K. [sent-366, score-0.45]
53 Theorem 12 (Exponentially weighted average forecaster in the bandit game) Consider ψ(x) = exp(ηx) + 1 I γ K with γ = min 2 3,2 K log(3K) n and η = 2 log(3K) Kn . [sent-390, score-0.615]
54 Then in the bandit game, against any adversary (possibly a 2Kn non-oblivious one), for any δ > 0, with probability at least 1 − δ, INF satisfies: Rn ≤ 3 nK log(3K) + 2nK log(Kδ−1 ), log(3K) √ and also ERn ≤ (3 + 2) nK log(3K). [sent-392, score-0.5]
55 Then in the bandit game, Theorem 13 (Poly INF in the bandit game) Let ψ(x) = βg 1 I i,t 1 t min 2 , 3 K . [sent-401, score-0.71]
56 Then, let k denote the arm for which the estimate Vi,n/2 = ∑1≤t≤n/2 vi,t of the cumulative reward of arm i is the smallest. [sent-408, score-0.51]
57 After time n/2, all rewards are chosen to be equal to zero ˆ except for arm k for which the rewards are still chosen to be equal to 1. [sent-409, score-0.421]
58 ,K} V j,n/2 ≥ κ nK log K for some small enough κ > 0, √ it seems that the INF algorithm achieving a bound of order nK on ERn in the oblivious setting √ will suffer an expected regret of order at least nK log K. [sent-413, score-0.66]
59 3 Label Efficient Bandit Game (LE Bandit) The following theorems concern the simple LE bandit game, in which the requirement is that the expected number of queried rewards should be less or equal to m. [sent-416, score-0.482]
60 Note that this policy does not fulfil exactly the requirement of the LE bandit game as we might ask a bit more than m rewards, but, as was argued in Section 5. [sent-419, score-0.724]
61 2800 R EGRET B OUNDS U NDER PARTIAL M ONITORING Theorem 15 (Exponentially weighted average forecaster in the simple LE bandit game) Let γ log 1 ψ(x) = exp(ηx) + K with γ = min 2 , 4K5m K and η = ε = m . [sent-423, score-0.765]
62 Then in the simple LE bandit game, INF satisfies: n Rn ≤ (1 − γ) 5m log K 4K . [sent-424, score-0.505]
63 Then in the β simple LE bandit game, against a deterministic adversary, for any δ > 0, with probability at least 1 − δ, INF satisfies: K K Rn ≤ 10. [sent-439, score-0.391]
64 Regret Bounds Scaling with the Optimal Arm Rewards In this section, we provide regret bounds for bandit games depending on the performance of the optimal arm: in these bounds, the factor n is essentially replaced by Gmax = max Gi,n , i=1,. [sent-448, score-0.791]
65 Such a bound has been proved on the expected regret for deterministic adversaries in the seminal work of Auer et al. [sent-452, score-0.443]
66 For deterministic adversaries, as in the bandit game, the log K factor appearing in the exponentially weighted average forecaster regret bound disappears in the Poly INF regret bound as follows. [sent-456, score-1.355]
67 Then in the bandit game, 0 against a deterministic adversary, for any δ > 0, with probability at least 1 − δ, INF satisfies: Rn ≤ 4. [sent-460, score-0.391]
68 Then in the bandit game, against a deterministic max adversary, for any δ > 0, with probability at least 1 − δ, INF satisfies: 2KGmax log(δ−1 ), Rn ≤ 9 KGmax + (20) KGmax . [sent-470, score-0.433]
69 (21) and ERn ≤ 10 For more general adversaries than fully oblivious ones, we have the following result in which the log K factor reappears. [sent-471, score-0.399]
70 Then in the bandit game, against any adversary (possibly a non-oblivious one), for any δ > 0, with probability at least 1 − δ, INF satisfies: Rn ≤ 9 2 G2 max K log(3K) + 4 G0 KG0 + log(3K) 2KG0 log(Kδ−1 ). [sent-475, score-0.542]
71 Then in the bandit game, against any adversary 2KG0 (possibly a non-oblivious one), for any δ > 0, with probability at least 1 − δ, INF satisfies: Rn ≤ 5 2 G2 1 max K log(3K) + G0 2 KG0 log(3K) + 2803 2KG0 log(Kδ−1 ). [sent-480, score-0.542]
72 t=1 The regret of a forecaster with respect to the best switching strategy with S switches is then given by: RS = n n max (i1 ,. [sent-501, score-0.514]
73 t=1 Theorem 22 (INF for tracking the best expert in the bandit game) Let s = S log 3nK + 2 log K with the natural convention S log(3nK/S) = 0 for S = 0. [sent-511, score-0.655]
74 n Note that for S = 0, we have RS = Rn , and we recover an expected regret bound of order n nK log K similar to the one of Theorem 12. [sent-515, score-0.409]
75 In Section 6, we provide regret bounds scaling with the cumulative reward of the optimal arm. [sent-536, score-0.427]
76 For this kind of results, renormalizing will not lead to regret bounds scaling with the cumulative reward before renormalization of the optimal arm, and consequently, the study of INF directly applied to the observed rewards is necessary. [sent-537, score-0.554]
77 In particular, obtaining low regret bounds when the optimal arm has small cumulative loss would require appropriate modifications in the proof. [sent-538, score-0.481]
78 Stochastic Bandit Game By considering the deterministic case when the rewards are gi,t = 1 if i = 1 and gi,t = 0 otherwise, it can be proved that the √ policies considered in Theorem 10 and Theorem 11 have a pseudoINF regret lower bounded by nK. [sent-540, score-0.416]
79 We recall that in the stochastic bandit considered in this section, the rewards gi,1 , . [sent-542, score-0.482]
80 We provide now a strategy achieving a nK regret in the worst case, and a much smaller regret as soon as the ∆i of the suboptimal arms are much larger than K/n. [sent-550, score-0.515]
81 Figure 3: The proposed policy for the stochastic bandit game. [sent-568, score-0.39]
82 This can be easily seen by considering a two-armed bandit in which both reward distributions are identical (and non degenerated). [sent-577, score-0.468]
83 Now with the calculus done in the third step of the proof of Theorem 27 and using the notations hi := hi (Vt−1 + ξ(Vt −Vt−1 )), mi := mi (Vt−1 + ξ(Vt −Vt−1 )) we obtain ϕ′ (ξ) = K hj j=1 ∑K hk k=1 ∑ (V j,t −V j,t−1 ) 1Ii= j − 2810 mi = hi K (vi,t − v j,t )h j mi . [sent-635, score-0.405]
84 Let sup represents the supremum taken over all oblivious adversaries and inf the infimum taken over all forecasters, then the following holds true in the label efficient game:4 inf sup Rn ≥ 0. [sent-653, score-0.576]
85 m and in the label efficient bandit game we have: inf sup Rn ≥ 0. [sent-655, score-0.868]
86 ∑ Pi (Jn = i) ≤ K i=1 log(K − 1) (28) We will use (28) for the full information games when K > 3 and (27) the bandits games and the full information games with K ∈ {2, 3}. [sent-678, score-0.406]
87 Denote by Ereward,i the expectation with respect to the reward generation process of the ith adversary, Erand the expectation with respect to the randomization of the forecaster and Ei the expectation with respect to both processes. [sent-700, score-0.371]
88 + βK log 1 − βK γε Then we have K − ∑ pi,t vi,t i=1 = Zt βgIt ,t pIt ,t log 1 − β εpIt ,t ≥ Zt − ≥ −gIt ,t βgIt ,t gIt ,t νgIt ,t + log 1 − ε ε εpIt ,t Zt νβ K − ∑ vi,t . [sent-799, score-0.45]
89 Thus, with probability at least 1 − δ, we have against deterministic adversaries max Vi,n ≥ Gmax − 1≤i≤K log δ−1 , β and against general adversaries max Vi,n ≥ Gmax − 1≤i≤K log(Kδ−1 ) . [sent-812, score-0.566]
90 (44) For deterministic adversaries, the arm achieving a cumulative reward equal to Gmax is deterministic (it does not depend on the randomization of the forecaster). [sent-864, score-0.413]
91 From (12) and as in the proof of Theorem 13, by using (40) with ν = (1 − γ) n γ βK K t=1 (45) i=1 max Vi,n − ∑ gIt ,t − νβ ∑ Vi,n ≤ (1 − γ) 1≤i≤K hence (1 − γ − νβK) n max Vi,t − ∑ gIt ,t ≤ (1 − γ) 1≤i≤K t=1 1 + log(1−βK/γ) , we obtain log K , η log K . [sent-905, score-0.416]
92 We have thus proved that for any 0 < γ < 1, η > 0 and β > 0 such that (46) holds, we have RS ≤ (γ + βK)n + n log K β γ 1 ˜ − + S ρ − log ˜ ρ K η η + log(Kδ−1 ) S log(Ken/S) + , 2β 2β ˜ with ρ = 1+Kβ exp (1+β) Kη . [sent-1028, score-0.412]
93 Therefore, by using Hoeffding’s inequality and +∞ Eτk − ℓ0 ≤ ∑ P (µk,ℓ − µk ≥ c∆k ) ℓ=ℓ0 +∞ ≤ ∑ ℓ=ℓ0 exp −2ℓ(c∆k )2 = exp −2ℓ0 (c∆k )2 exp(−14c2 log(75)) , ≤ 1 − exp (−2(c∆k )2 ) 1 − exp −2c2 ∆2 k where the last inequality uses ℓ0 ∆2 ≥ 7 log(75). [sent-1067, score-0.52]
94 We have s P ∃1 ≤ s ≤ f (u) : ∑ (µ1 − Xt ) > s log t=1 +∞ n Ks s f (u) 1 n2ℓ f (u) : ∑ (µ1 − Xt ) > log 2ℓ+1 2ℓ 2ℓ+1 K f (u) t=1 ℓ=0 ℓ 1 +∞ +∞ f (u) 2ℓ+1 log Kn2 f (u) n = ∑ K f (u) 1 = 16K log u . [sent-1087, score-0.6]
95 deterministic adversaries In particular, this means that the worst oblivious adversary for a forecaster cannot lead to a larger regret than the worst deterministic adversary. [sent-1101, score-0.938]
96 deterministic adversaries While the previous proposition is useful for upper bounding the regret of a forecaster against the worst oblivious adversary, it does not say anything about the difference between the expected regret and the pseudo-regret for a given adversary. [sent-1105, score-1.024]
97 The next proposition gives an upper bound on this difference for fully oblivious adversaries, which are (oblivious) adversaries generating independently the reward vectors. [sent-1106, score-0.424]
98 Proposition 34 For fully oblivious adversaries, we have n log K , 2 ERn − Rn ≤ and n ERn − Rn ≤ 2 log(K) max E ∑ gi,t + i t=1 log K . [sent-1107, score-0.443]
99 We can strengthen the previous result on the difference between the expected regret and the pseudo-regret when we consider the stochastic bandit game, in which the rewards coming from a given arm form an i. [sent-1119, score-0.881]
100 In the stochastic bandit game, we have n log |I| 1 n∆2 ERn − Rn ≤ + ∑ exp − i , 2 2 i∈J ∆i and also ERn − Rn ≤ n log |I| +∑ 2 i∈J n∆2 8σ2 + 4∆i /3 i exp − 2 i ∆i 8σi + 4∆i /3 + 8σ2∗ + 4∆i /3 n∆2 i exp − 2 i ∆i 8σi∗ + 4∆i /3 where for any j ∈ {1, . [sent-1131, score-0.991]
wordName wordTfidf (topN-words)
[('bandit', 0.355), ('game', 0.334), ('git', 0.323), ('ern', 0.252), ('forecaster', 0.24), ('regret', 0.232), ('gmax', 0.208), ('arm', 0.167), ('log', 0.15), ('adversaries', 0.148), ('adversary', 0.145), ('vt', 0.144), ('ubeck', 0.142), ('onitoring', 0.137), ('nk', 0.128), ('rewards', 0.127), ('inf', 0.124), ('games', 0.123), ('egret', 0.122), ('nder', 0.117), ('reward', 0.113), ('audibert', 0.112), ('exp', 0.112), ('oblivious', 0.101), ('zt', 0.098), ('ounds', 0.096), ('rn', 0.092), ('hi', 0.084), ('ct', 0.081), ('auer', 0.07), ('forecasters', 0.065), ('du', 0.055), ('poly', 0.054), ('cn', 0.053), ('pt', 0.053), ('jn', 0.051), ('arms', 0.051), ('kl', 0.05), ('theorem', 0.049), ('hk', 0.045), ('egmax', 0.044), ('cumulative', 0.043), ('pit', 0.042), ('max', 0.042), ('bounds', 0.039), ('ei', 0.039), ('eit', 0.038), ('ks', 0.038), ('vit', 0.037), ('bandits', 0.037), ('minimax', 0.037), ('deterministic', 0.036), ('exponentially', 0.036), ('inequality', 0.036), ('fi', 0.036), ('le', 0.035), ('policy', 0.035), ('page', 0.034), ('wt', 0.034), ('pi', 0.033), ('ewn', 0.033), ('proof', 0.032), ('partial', 0.031), ('label', 0.031), ('fit', 0.029), ('kgmax', 0.027), ('moss', 0.027), ('nks', 0.027), ('nu', 0.027), ('bound', 0.027), ('zk', 0.026), ('lugosi', 0.026), ('wn', 0.025), ('eg', 0.025), ('tk', 0.024), ('sup', 0.024), ('logarithmic', 0.024), ('qt', 0.024), ('ottucs', 0.023), ('remark', 0.023), ('abel', 0.022), ('bt', 0.022), ('satis', 0.021), ('policies', 0.021), ('ek', 0.021), ('weighted', 0.02), ('lemma', 0.02), ('dence', 0.02), ('xt', 0.02), ('let', 0.02), ('rs', 0.019), ('mi', 0.019), ('fano', 0.019), ('rgy', 0.019), ('upper', 0.018), ('pn', 0.018), ('randomization', 0.018), ('hoeffding', 0.018), ('proposition', 0.017), ('round', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
2 0.20563069 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
3 0.11719915 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
4 0.08822725 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
6 0.083458818 80 jmlr-2010-On-Line Sequential Bin Packing
7 0.080508441 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
8 0.072722539 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
9 0.058571644 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
10 0.055234924 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
11 0.051303484 25 jmlr-2010-Composite Binary Losses
12 0.050901439 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
13 0.050586648 72 jmlr-2010-Matrix Completion from Noisy Entries
14 0.045750439 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
15 0.044371773 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
16 0.039409697 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
17 0.036882207 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
18 0.036766239 69 jmlr-2010-Lp-Nested Symmetric Distributions
19 0.036129035 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
20 0.035548285 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
topicId topicWeight
[(0, -0.167), (1, -0.15), (2, 0.139), (3, -0.098), (4, 0.047), (5, -0.091), (6, -0.002), (7, 0.154), (8, 0.115), (9, 0.184), (10, 0.321), (11, 0.041), (12, 0.03), (13, -0.141), (14, 0.103), (15, 0.175), (16, -0.075), (17, -0.046), (18, 0.043), (19, 0.066), (20, -0.014), (21, -0.104), (22, -0.015), (23, 0.025), (24, 0.013), (25, 0.015), (26, -0.02), (27, 0.171), (28, -0.03), (29, -0.141), (30, 0.062), (31, -0.141), (32, -0.109), (33, 0.112), (34, -0.034), (35, 0.215), (36, -0.141), (37, -0.072), (38, 0.028), (39, -0.004), (40, -0.004), (41, -0.008), (42, 0.049), (43, -0.213), (44, -0.019), (45, 0.014), (46, -0.026), (47, 0.014), (48, 0.042), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.93024814 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
2 0.74057692 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
3 0.36979106 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
4 0.34511483 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
5 0.32676342 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin
Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization
7 0.2853286 80 jmlr-2010-On-Line Sequential Bin Packing
8 0.28388488 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
9 0.26383778 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
10 0.2495634 72 jmlr-2010-Matrix Completion from Noisy Entries
11 0.24905375 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
12 0.24521795 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
13 0.24472085 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
14 0.24100657 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
15 0.19663168 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
16 0.19318916 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
17 0.18348762 25 jmlr-2010-Composite Binary Losses
18 0.17804873 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
19 0.17593876 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected
20 0.16778617 102 jmlr-2010-Semi-Supervised Novelty Detection
topicId topicWeight
[(1, 0.014), (3, 0.044), (4, 0.012), (8, 0.048), (15, 0.044), (21, 0.02), (32, 0.035), (36, 0.021), (37, 0.065), (38, 0.076), (59, 0.313), (75, 0.107), (82, 0.018), (85, 0.04), (93, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.70547348 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
2 0.44573411 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
3 0.40613189 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.40046486 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
5 0.39186093 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
7 0.38596871 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
8 0.38456827 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
10 0.38320354 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
11 0.38236561 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
12 0.38200295 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
13 0.38063365 109 jmlr-2010-Stochastic Composite Likelihood
14 0.38022763 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
15 0.37929425 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
16 0.37926158 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
17 0.37768826 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
18 0.37730649 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
19 0.37728509 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
20 0.37710381 82 jmlr-2010-On Learning with Integral Operators